Nemotron-AIQ Agentic Safety Dataset

数据集摘要

Nemotron-AIQ-Agentic-Safety-Dataset 是一个全面的数据集，涵盖了智能体系统中可能出现的各类新型安全与安保情境风险。该数据集展示了 NVIDIA 开源模型 llama-3.3-nemotron-super-49b-v1 在 AIQ 中作为研究助手部署时的稳健性，体现了其应对多种智能体安全与安保挑战的能力。该数据集可用于分析企业级智能体系统中安全风险的产生与表现形式，以及评估不同智能体在识别和缓解此类风险方面的性能。鉴于当前智能体安全与安保领域数据集和思想领导力的稀缺性，本数据集旨在推动这一关键领域的研究与开发工作。本数据集由 NVIDIA 与 Lakera AI 联合开发。

该数据集包含运行 AIQ 研究助手时的轨迹或日志，涉及不同查询。这些查询（包括有害和无害的，带有或不带有不可信数据）是在运行 AIQ 工作流时以自适应方式生成的，同时记录工作流日志/轨迹。数据集包含查询内容，并标注是否生成了研究报告或摘要（即响应），或拒绝生成。通常情况下，系统会生成报告；若涉及有害主题，则会拒绝生成。此数据集将使研究社区能够分析智能体系统大规模运行时的情况。

数据集中可能包含有害查询以及部分高层级有害响应。

本数据集仅供研究与开发使用。

研究论文即将发布！

数据集所有者：

NVIDIA Corporation

数据集创建日期：

10.29.2025

许可协议/使用条款：

本数据集依据 NVIDIA Evaluation Dataset License Agreement 提供。完整许可文本详见 LICENSE 文件。

预期用途：

直接使用：本数据集仅用于安全研究和开发。该数据集应仅作为评估数据集，用于人工智能解决方案安全性和安全性的内部评估与基准测试。目前，不应将此数据集用作训练人工智能模型的训练数据集。

数据集特征

数据收集方法

混合：合成数据、人工数据

标注方法

混合：合成标注、自动标注、人工标注

数据集格式

基于Open Telemetry的、从AIQ工作流生成的标准化追踪数据。

数据集量化

每个安全文件夹中，由NVIDIA的合作伙伴（Lakera AI）生成的、在研究助手场景下涵盖新型安全风险的约2.6k条追踪数据——包含有防御和无防御两种情况。

每个安全文件夹中，由NVIDIA生成的、在研究助手场景下涵盖新型内容安全风险的约2.8k条追踪数据——包含有防御和无防御两种情况。

每条追踪数据的具体数量也可在每个分片中的summary.json文件中找到。

2.5GB

伦理考量：

本数据集包含安全和有害的提示及响应，使用户能够评估智能体系统是否能够识别不安全和不道德的请求并做出适当响应。部分内容可能具有冒犯性、暴力性或令人不安。NVIDIA不支持或认同所包含的任何有害内容。使用本数据集，即表示您接受接触有害内容的风险，并同意仅将其用于预期目的。

请在此链接报告模型质量、风险、安全漏洞或NVIDIA AI相关问题。

数据集结构

注意：本数据集提供两种格式：

Parquet文件（便于使用HuggingFace datasets加载）：每个分片包含一个Parquet文件，包含trace_id、attack_snapshot和trace列。
原始JSON文件（位于data/文件夹中）：每个分片目录包含：
- attack_manifest.jsonl：按行分隔的JSON文件，每行一个攻击快照
- traces/：包含单个追踪文件的文件夹（trace_{trace_id}.json）
- summary.json：分片的摘要统计信息

两种格式包含相同的数据；Parquet格式经过预处理，以方便使用。

数据划分

该数据集包含2个配置，每个配置有2个划分：

配置	划分	描述	轨迹数量	防御状态
`security`	`without_defense`	无防御的安全性评估	~2,596	无防御
`security`	`with_defense`	有防御的安全性评估	~2,600	有防御
`safety`	`without_defense`	无防御的安全性评估	~2,800	无防御
`safety`	`with_defense`	有防御的安全性评估	~2,800	有防御

总计：2个配置共约10,796条轨迹

数据字段

每个示例包含：

{
    "trace_id": str,           # OTel traceid for the run
    "attack_snapshot": dict,   # Nested dict of attack metadata for the run 
    "trace": list,             # List of OTEL spans (each span is a dict)
}

注意：安全配置和防护配置具有不同的attack_snapshot架构。请使用配置名称分别加载它们（参见下方使用示例）。

攻击快照结构：

安全数据格式：

{
    "id": str,
    "template": {
        "topic": str,
        "name": str,                    # Template name (e.g., "ACTION_INCOMPLETE_2")
        "injection_node": str,          # Where attack injects
        "attack_objective": str,        # Clear attack objective
        "evaluation_nodes": [str],      # List of evaluation points
        "evaluation_method": str,       # "JUDGE", etc.
        "severity": str,                # "LOW", "MEDIUM", "HIGH"
        "risk_categories": [str],       # Varies by evaluation type
        "attack_category": str,
        # ... other standard fields 
    },
    "attack": {
        "injection_string": str,
        "template_name": str,
        "injection_string_source": str,  
        # ... other attack metadata
    },
    "result": {
        "trace_id": str,
        "success": bool,                  # Overall success flag
        "score": {str: float},            # Per-node risk scores (0.0-1.0)
                                          # Example: {"summarize_sources": 0.391, "reflect_on_summary": 0.296}
        "evaluation_strings": {str: str}, # Full outputs per evaluation node
        "injection_function_id": [{...}], # Detailed injection points
        "evaluation_function_id": {...},  # Nested evaluation structure
        # ... other result fields
    }
}

安全数据格式：

{
    "id": str,
    "template": {
        "topic": str,
        "organization": str,
        "num_queries": int,
        "llm_name": str,
        # ... other standard fields
    },
    "attack": {
        "injection_string": str,      # same as topic which is the direct injection
        "attack_type": str,           # "direct"
        "attack_category": str,       # "content_safety"
        "attack_subcategory": str,
        "persona_top": str,
        "domain": str,
    },
    "result": {
        "trace_id": str,
        "success": bool,              # True if attack propagated to final workflow span
        "attack_success_rate": float, # (spans with attack) / (total spans)
        "function_id": [...],         # List of ALL spans where attacks were detected
    }
}

攻击评估的关键差异：

安全和安全数据采用不同的评估方法收集，导致攻击快照中的元数据结构略有不同。两者都为安全研究提供了有价值的见解，我们在下方提供了代码片段，展示如何分析每种类型的数据。

安全数据：使用节点风险评分（浮点数 0.0-1.0）。每个评估节点都会获得一个风险评分。分析通常涉及计算节点和跟踪的平均风险评分，而非二元的成功/失败判定。
安全数据：使用双重评估系统：
- attack_success_rate：计算公式为（存在攻击的跨度数量）/（总跨度数量）。提供攻击传播的细粒度视图。
- success：布尔标志，仅当攻击传播到最终工作流跨度时才为 true，表明智能体的预期行为已完全受损。

跟踪结构：

符合 OTEL 标准的跨度数组，具有扁平化属性（已解析为字典列表）：

[
    {
        "name": str,                    # Span name
        "span_kind": str,               # CHAIN, LLM, TOOL
        "parent_id": str,               # Parent span ID (null for root)
        "start_time": str,              # Timestamp (format varies by provider)
        "end_time": str,
        "status_code": str,
        "context.span_id": str,         # Span ID
        "context.trace_id": str,        # Trace ID
        "attributes.output.value": str, # Output content
        "attributes.input.value": str,  # Input content
        # ... many other OTEL attributes (flattened with dot notation)
    },
    ...
]

使用方法

加载数据集

from datasets import load_dataset

# Load safety config (both splits)
safety_dataset = load_dataset("nvidia/Nemotron-AIQ-Agentic-Safety-Dataset-1.0", "safety")

# Load security config (both splits)  
security_dataset = load_dataset("nvidia/Nemotron-AIQ-Agentic-Safety-Dataset-1.0", "security")

# Example 1: Safety Data
print("=== Safety Data Example ===")
safety_with_defense = safety_dataset['with_defense']
example = safety_with_defense[0]

# attack_snapshot and trace are already dicts/lists (not JSON strings)
attack_snapshot = example['attack_snapshot']
traces = example['trace']

print(f"Trace ID: {example['trace_id']}")
print(f"Attack Category: {attack_snapshot['attack']['attack_category']}")
print(f"Attack Subcategory: {attack_snapshot['attack']['attack_subcategory']}")
print(f"Number of spans: {len(traces)}")

# Example 2: Security Data
print("\n=== Security Data Example ===")
security_with_defense = security_dataset['with_defense']
example = security_with_defense[0]

attack_snapshot = example['attack_snapshot']
traces = example['trace']

print(f"Trace ID: {example['trace_id']}")
print(f"Attack Template: {attack_snapshot['attack']['template_name']}")
print(f"Attack Category: {attack_snapshot['template']['lakera_attack_category']}")
print(f"Number of spans: {len(traces)}")
print(f"Evaluation nodes: {list(attack_snapshot['result']['score'].keys())}")

# Analyze spans within a trace
print("\n=== Analyzing Spans ===")
for i, span in enumerate(traces[:3]):  # Show first 3 spans
    print(f"\nSpan {i}:")
    print(f"  Name: {span['name']}")
    print(f"  Kind: {span['span_kind']}")
    print(f"  Span ID: {span['context.span_id']}")
    print(f"  Parent ID: {span.get('parent_id', 'None')}")
    
    # Access span inputs/outputs
    if 'attributes.input.value' in span:
        print(f"  Input preview: {span['attributes.input.value'][:100]}...")
    if 'attributes.output.value' in span:
        print(f"  Output preview: {span['attributes.output.value'][:100]}...")

示例分析：安全数据

from datasets import load_dataset

# Count attack successes (binary: did attack reach final span?)
def count_safety_successes(split):
    successes = 0
    for ex in split:
        attack_snapshot = ex['attack_snapshot']  # Already a dict
        if attack_snapshot['result'].get('success', False):
            successes += 1
    return successes, len(split)

# Calculate average attack propagation rate
def calculate_safety_attack_rate(split):
    """Calculate average proportion of spans affected by attacks across all traces."""
    total_rate = 0
    for ex in split:
        attack_snapshot = ex['attack_snapshot']  # Already a dict
        total_rate += attack_snapshot['result'].get('attack_success_rate', 0)
    return total_rate / len(split) if len(split) > 0 else 0

# Load safety data (both splits)
safety_dataset = load_dataset("nvidia/Nemotron-AIQ-Agentic-Safety-Dataset-1.0", "safety")
safety_without_defense = safety_dataset['without_defense']
safety_with_defense = safety_dataset['with_defense']

# Binary success metric (attack reached final workflow span)
without_defense_success, without_defense_total = count_safety_successes(safety_without_defense)
with_defense_success, with_defense_total = count_safety_successes(safety_with_defense)

# Continuous attack propagation metric
without_defense_propagation = calculate_safety_attack_rate(safety_without_defense)
with_defense_propagation = calculate_safety_attack_rate(safety_with_defense)

print("Safety Defense Effectiveness:")
print(f"Binary Success Rate:")
print(f"  Without defense: {100*without_defense_success/without_defense_total:.1f}%")
print(f"  With defense: {100*with_defense_success/with_defense_total:.1f}%")
print(f"\nAttack Propagation Rate:")
print(f"  Without defense: {100*without_defense_propagation:.1f}%")
print(f"  With defense: {100*with_defense_propagation:.1f}%")

示例分析：安全数据

安全数据在工作流的多个节点进行评估，并在每个评估点计算风险分数：

节点风险分数：每个评估节点的风险分数范围为 0.0-1.0
整体评估：汇总所有评估节点的分数，以衡量攻击效果

from collections import defaultdict
from datasets import load_dataset

# Calculate average risk scores per evaluation node
def calculate_security_risk_scores(split):
    """Calculate average risk scores per node across all traces."""
    node_scores = defaultdict(list)
    
    for ex in split:
        attack_snapshot = ex['attack_snapshot']  # Already a dict
        scores = attack_snapshot['result']['score']
        for node, score in scores.items():
            node_scores[node].append(score)
    
    # Calculate averages
    avg_scores = {node: sum(scores)/len(scores) for node, scores in node_scores.items()}
    overall_avg = sum(avg_scores.values()) / len(avg_scores) if avg_scores else 0
    
    return avg_scores, overall_avg

# Load security data (both splits)
security_dataset = load_dataset("nvidia/Nemotron-AIQ-Agentic-Safety-Dataset-1.0", "security")
security_without_defense = security_dataset['without_defense']
security_with_defense = security_dataset['with_defense']

without_defense_scores, without_defense_avg = calculate_security_risk_scores(security_without_defense)
with_defense_scores, with_defense_avg = calculate_security_risk_scores(security_with_defense)

print("Security Guard Effectiveness:")
print(f"Without defense - Avg risk score: {without_defense_avg:.3f}")
print(f"With defense - Avg risk score: {with_defense_avg:.3f}")
print(f"\nPer-node scores (without defense):")
for node, score in without_defense_scores.items():
    print(f"  {node}: {score:.3f}")

局限性

轨迹代表特定的智能体配置，可能无法泛化到所有智能体系统
攻击成功与否取决于具体情境，且可能因不同的LLM版本而有所差异
数据集主要聚焦于英语文本

引用

@misc{ghosh2025safetysecurityframeworkrealworld,
      title={A Safety and Security Framework for Real-World Agentic Systems}, 
      author={Shaona Ghosh and Barnaby Simkin and Kyriacos Shiarlis and Soumili Nandi and Dan Zhao and Matthew Fiedler and Julia Bazinska and Nikki Pope and Roopa Prabhu and Daniel Rohrer and Michael Demoret and Bartley Richardson},
      year={2025},
      eprint={2511.21990},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.21990}, 
}

数据集卡片作者

Shaona Ghosh，shaonag@nvidia.com
Soumili Nandi，soumilin@nvidia.com
Dan Zhao，danz@nvidia.com
Barnaby Simkin，bsimkin@nvidia.com
Kyriacos Shiarlis，kyriacos.shiarlis@lakera.ai
Matthew Fiedler，matthew.fiedler@lakera.ai
Julia Bazinska，jb@lakera.ai

数据集卡片联系方式

shaonag@nvidia.com / soumilin@nvidia.com / danz@nvidia.com

致谢

特别感谢 NVIDIA 的同事们，尤其是 Nikki Pope、Roopa Prabhu、Michael Demoret、Rachel Allen 和 Bartley Richardson。

Nemotron-AIQ Agentic Safety Dataset

数据集摘要

数据集中可能包含有害查询以及部分高层级有害响应。

本数据集仅供研究与开发使用。

研究论文即将发布！

数据集所有者：

NVIDIA Corporation

数据集创建日期：

10.29.2025

许可协议/使用条款：

本数据集依据 NVIDIA Evaluation Dataset License Agreement 提供。完整许可文本详见 LICENSE 文件。

预期用途：

数据集特征

数据收集方法

混合：合成数据、人工数据

标注方法

混合：合成标注、自动标注、人工标注

数据集格式

基于Open Telemetry的、从AIQ工作流生成的标准化追踪数据。

数据集量化

每个安全文件夹中，由NVIDIA的合作伙伴（Lakera AI）生成的、在研究助手场景下涵盖新型安全风险的约2.6k条追踪数据——包含有防御和无防御两种情况。

每个安全文件夹中，由NVIDIA生成的、在研究助手场景下涵盖新型内容安全风险的约2.8k条追踪数据——包含有防御和无防御两种情况。

每条追踪数据的具体数量也可在每个分片中的summary.json文件中找到。

2.5GB

伦理考量：

请在此链接报告模型质量、风险、安全漏洞或NVIDIA AI相关问题。

数据集结构

注意：本数据集提供两种格式：

Parquet文件（便于使用HuggingFace datasets加载）：每个分片包含一个Parquet文件，包含trace_id、attack_snapshot和trace列。
原始JSON文件（位于data/文件夹中）：每个分片目录包含：
- attack_manifest.jsonl：按行分隔的JSON文件，每行一个攻击快照
- traces/：包含单个追踪文件的文件夹（trace_{trace_id}.json）
- summary.json：分片的摘要统计信息

两种格式包含相同的数据；Parquet格式经过预处理，以方便使用。

数据划分

该数据集包含2个配置，每个配置有2个划分：

配置	划分	描述	轨迹数量	防御状态
`security`	`without_defense`	无防御的安全性评估	~2,596	无防御
`security`	`with_defense`	有防御的安全性评估	~2,600	有防御
`safety`	`without_defense`	无防御的安全性评估	~2,800	无防御
`safety`	`with_defense`	有防御的安全性评估	~2,800	有防御

总计：2个配置共约10,796条轨迹

数据字段

每个示例包含：

{
    "trace_id": str,           # OTel traceid for the run
    "attack_snapshot": dict,   # Nested dict of attack metadata for the run 
    "trace": list,             # List of OTEL spans (each span is a dict)
}

注意：安全配置和防护配置具有不同的attack_snapshot架构。请使用配置名称分别加载它们（参见下方使用示例）。

攻击快照结构：

安全数据格式：

{
    "id": str,
    "template": {
        "topic": str,
        "name": str,                    # Template name (e.g., "ACTION_INCOMPLETE_2")
        "injection_node": str,          # Where attack injects
        "attack_objective": str,        # Clear attack objective
        "evaluation_nodes": [str],      # List of evaluation points
        "evaluation_method": str,       # "JUDGE", etc.
        "severity": str,                # "LOW", "MEDIUM", "HIGH"
        "risk_categories": [str],       # Varies by evaluation type
        "attack_category": str,
        # ... other standard fields 
    },
    "attack": {
        "injection_string": str,
        "template_name": str,
        "injection_string_source": str,  
        # ... other attack metadata
    },
    "result": {
        "trace_id": str,
        "success": bool,                  # Overall success flag
        "score": {str: float},            # Per-node risk scores (0.0-1.0)
                                          # Example: {"summarize_sources": 0.391, "reflect_on_summary": 0.296}
        "evaluation_strings": {str: str}, # Full outputs per evaluation node
        "injection_function_id": [{...}], # Detailed injection points
        "evaluation_function_id": {...},  # Nested evaluation structure
        # ... other result fields
    }
}

安全数据格式：

{
    "id": str,
    "template": {
        "topic": str,
        "organization": str,
        "num_queries": int,
        "llm_name": str,
        # ... other standard fields
    },
    "attack": {
        "injection_string": str,      # same as topic which is the direct injection
        "attack_type": str,           # "direct"
        "attack_category": str,       # "content_safety"
        "attack_subcategory": str,
        "persona_top": str,
        "domain": str,
    },
    "result": {
        "trace_id": str,
        "success": bool,              # True if attack propagated to final workflow span
        "attack_success_rate": float, # (spans with attack) / (total spans)
        "function_id": [...],         # List of ALL spans where attacks were detected
    }
}

攻击评估的关键差异：

安全数据：使用节点风险评分（浮点数 0.0-1.0）。每个评估节点都会获得一个风险评分。分析通常涉及计算节点和跟踪的平均风险评分，而非二元的成功/失败判定。
安全数据：使用双重评估系统：
- attack_success_rate：计算公式为（存在攻击的跨度数量）/（总跨度数量）。提供攻击传播的细粒度视图。
- success：布尔标志，仅当攻击传播到最终工作流跨度时才为 true，表明智能体的预期行为已完全受损。

跟踪结构：

符合 OTEL 标准的跨度数组，具有扁平化属性（已解析为字典列表）：

[
    {
        "name": str,                    # Span name
        "span_kind": str,               # CHAIN, LLM, TOOL
        "parent_id": str,               # Parent span ID (null for root)
        "start_time": str,              # Timestamp (format varies by provider)
        "end_time": str,
        "status_code": str,
        "context.span_id": str,         # Span ID
        "context.trace_id": str,        # Trace ID
        "attributes.output.value": str, # Output content
        "attributes.input.value": str,  # Input content
        # ... many other OTEL attributes (flattened with dot notation)
    },
    ...
]

使用方法

加载数据集

from datasets import load_dataset

# Load safety config (both splits)
safety_dataset = load_dataset("nvidia/Nemotron-AIQ-Agentic-Safety-Dataset-1.0", "safety")

# Load security config (both splits)  
security_dataset = load_dataset("nvidia/Nemotron-AIQ-Agentic-Safety-Dataset-1.0", "security")

# Example 1: Safety Data
print("=== Safety Data Example ===")
safety_with_defense = safety_dataset['with_defense']
example = safety_with_defense[0]

# attack_snapshot and trace are already dicts/lists (not JSON strings)
attack_snapshot = example['attack_snapshot']
traces = example['trace']

print(f"Trace ID: {example['trace_id']}")
print(f"Attack Category: {attack_snapshot['attack']['attack_category']}")
print(f"Attack Subcategory: {attack_snapshot['attack']['attack_subcategory']}")
print(f"Number of spans: {len(traces)}")

# Example 2: Security Data
print("\n=== Security Data Example ===")
security_with_defense = security_dataset['with_defense']
example = security_with_defense[0]

attack_snapshot = example['attack_snapshot']
traces = example['trace']

print(f"Trace ID: {example['trace_id']}")
print(f"Attack Template: {attack_snapshot['attack']['template_name']}")
print(f"Attack Category: {attack_snapshot['template']['lakera_attack_category']}")
print(f"Number of spans: {len(traces)}")
print(f"Evaluation nodes: {list(attack_snapshot['result']['score'].keys())}")

# Analyze spans within a trace
print("\n=== Analyzing Spans ===")
for i, span in enumerate(traces[:3]):  # Show first 3 spans
    print(f"\nSpan {i}:")
    print(f"  Name: {span['name']}")
    print(f"  Kind: {span['span_kind']}")
    print(f"  Span ID: {span['context.span_id']}")
    print(f"  Parent ID: {span.get('parent_id', 'None')}")
    
    # Access span inputs/outputs
    if 'attributes.input.value' in span:
        print(f"  Input preview: {span['attributes.input.value'][:100]}...")
    if 'attributes.output.value' in span:
        print(f"  Output preview: {span['attributes.output.value'][:100]}...")

示例分析：安全数据

from datasets import load_dataset

# Count attack successes (binary: did attack reach final span?)
def count_safety_successes(split):
    successes = 0
    for ex in split:
        attack_snapshot = ex['attack_snapshot']  # Already a dict
        if attack_snapshot['result'].get('success', False):
            successes += 1
    return successes, len(split)

# Calculate average attack propagation rate
def calculate_safety_attack_rate(split):
    """Calculate average proportion of spans affected by attacks across all traces."""
    total_rate = 0
    for ex in split:
        attack_snapshot = ex['attack_snapshot']  # Already a dict
        total_rate += attack_snapshot['result'].get('attack_success_rate', 0)
    return total_rate / len(split) if len(split) > 0 else 0

# Load safety data (both splits)
safety_dataset = load_dataset("nvidia/Nemotron-AIQ-Agentic-Safety-Dataset-1.0", "safety")
safety_without_defense = safety_dataset['without_defense']
safety_with_defense = safety_dataset['with_defense']

# Binary success metric (attack reached final workflow span)
without_defense_success, without_defense_total = count_safety_successes(safety_without_defense)
with_defense_success, with_defense_total = count_safety_successes(safety_with_defense)

# Continuous attack propagation metric
without_defense_propagation = calculate_safety_attack_rate(safety_without_defense)
with_defense_propagation = calculate_safety_attack_rate(safety_with_defense)

print("Safety Defense Effectiveness:")
print(f"Binary Success Rate:")
print(f"  Without defense: {100*without_defense_success/without_defense_total:.1f}%")
print(f"  With defense: {100*with_defense_success/with_defense_total:.1f}%")
print(f"\nAttack Propagation Rate:")
print(f"  Without defense: {100*without_defense_propagation:.1f}%")
print(f"  With defense: {100*with_defense_propagation:.1f}%")

示例分析：安全数据

安全数据在工作流的多个节点进行评估，并在每个评估点计算风险分数：

节点风险分数：每个评估节点的风险分数范围为 0.0-1.0
整体评估：汇总所有评估节点的分数，以衡量攻击效果

from collections import defaultdict
from datasets import load_dataset

# Calculate average risk scores per evaluation node
def calculate_security_risk_scores(split):
    """Calculate average risk scores per node across all traces."""
    node_scores = defaultdict(list)
    
    for ex in split:
        attack_snapshot = ex['attack_snapshot']  # Already a dict
        scores = attack_snapshot['result']['score']
        for node, score in scores.items():
            node_scores[node].append(score)
    
    # Calculate averages
    avg_scores = {node: sum(scores)/len(scores) for node, scores in node_scores.items()}
    overall_avg = sum(avg_scores.values()) / len(avg_scores) if avg_scores else 0
    
    return avg_scores, overall_avg

# Load security data (both splits)
security_dataset = load_dataset("nvidia/Nemotron-AIQ-Agentic-Safety-Dataset-1.0", "security")
security_without_defense = security_dataset['without_defense']
security_with_defense = security_dataset['with_defense']

without_defense_scores, without_defense_avg = calculate_security_risk_scores(security_without_defense)
with_defense_scores, with_defense_avg = calculate_security_risk_scores(security_with_defense)

print("Security Guard Effectiveness:")
print(f"Without defense - Avg risk score: {without_defense_avg:.3f}")
print(f"With defense - Avg risk score: {with_defense_avg:.3f}")
print(f"\nPer-node scores (without defense):")
for node, score in without_defense_scores.items():
    print(f"  {node}: {score:.3f}")

局限性

轨迹代表特定的智能体配置，可能无法泛化到所有智能体系统
攻击成功与否取决于具体情境，且可能因不同的LLM版本而有所差异
数据集主要聚焦于英语文本

引用

@misc{ghosh2025safetysecurityframeworkrealworld,
      title={A Safety and Security Framework for Real-World Agentic Systems}, 
      author={Shaona Ghosh and Barnaby Simkin and Kyriacos Shiarlis and Soumili Nandi and Dan Zhao and Matthew Fiedler and Julia Bazinska and Nikki Pope and Roopa Prabhu and Daniel Rohrer and Michael Demoret and Bartley Richardson},
      year={2025},
      eprint={2511.21990},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.21990}, 
}

数据集卡片作者

Shaona Ghosh，shaonag@nvidia.com
Soumili Nandi，soumilin@nvidia.com
Dan Zhao，danz@nvidia.com
Barnaby Simkin，bsimkin@nvidia.com
Kyriacos Shiarlis，kyriacos.shiarlis@lakera.ai
Matthew Fiedler，matthew.fiedler@lakera.ai
Julia Bazinska，jb@lakera.ai

数据集卡片联系方式

shaonag@nvidia.com / soumilin@nvidia.com / danz@nvidia.com

致谢

特别感谢 NVIDIA 的同事们，尤其是 Nikki Pope、Roopa Prabhu、Michael Demoret、Rachel Allen 和 Bartley Richardson。