multi-agent/docs/draft.md

# 多 Agent 智能软件开发平台 —— 架构设计文档

## 1. 项目概述

### 1.1 目标

在软件工程任务场景下，构建一个**可灵活定义和编排多 Agent 的实验平台**，支持多种协作/竞争方案的快速切换与对比，最终通过系统性评测验证出性能最优的 Agent 交互方案，并以此为核心产出学术论文。

### 1.2 核心研究问题（来自 GQM v2）

| 编号 | 问题 |
|------|------|
| Q1 | 多智能体是否提升了任务完成质量，并保持跨层表示一致性？ |
| Q2 | 协作是否真的有效？ |
| Q3 | 协作结构是否影响性能？ |
| Q4 | 不同交互机制（协作/竞争）如何影响探索、适应、冲突处理与系统稳定性？ |
| Q5 | 人类介入是否带来可接受的成本，并提升系统可靠性？ |

---

## 2. 系统整体架构

```
┌─────────────────────────────────────────────────────────┐
│                      实验平台 (Platform Layer)            │
│                                                         │
│  ┌─────────────┐   ┌──────────────┐   ┌─────────────┐  │
│  │  方案配置层  │   │   运行调度层  │   │  评测采集层  │  │
│  │ Scheme DSL  │   │  Orchestrator│   │  Metrics    │  │
│  └─────────────┘   └──────────────┘   └─────────────┘  │
│           │                │                 │          │
│  ┌─────────────────────────────────────────────────┐    │
│  │                  Agent Runtime                  │    │
│  │   Agent A   ←→   Agent B   ←→   Agent C  ...   │    │
│  │                    ↕ (Human-in-the-loop)        │    │
│  │               Human Intervention Gate           │    │
│  └─────────────────────────────────────────────────┘    │
│           │                                             │
│  ┌─────────────────────────────────────────────────┐    │
│  │              Task & Environment Layer            │    │
│  │    SWE-bench / HumanEval / 自定义任务集           │    │
│  └─────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────┘
```

---

## 3. 核心模块设计

### 3.1 Agent 定义层（Agent Definition）

每个 Agent 通过配置文件（YAML / Python dataclass）灵活定义，无需改动核心代码：

```yaml
agent:
  id: "dev_agent_01"
  role: developer          # roles: analyst | developer | tester | reviewer | human
  model: claude-sonnet-4-6
  system_prompt: "..."
  tools:
    - code_edit
    - file_search
    - run_tests
  memory:
    type: shared | private  # 共享记忆 or 私有记忆
  interaction_mode: cooperative | competitive
```

**预定义角色：**

| 角色 | 职责 |
|------|------|
| Analyst（需求分析） | 解析自然语言需求，输出形式化规格 |
| Developer（开发） | 根据规格生成/修改代码 |
| Reviewer（审查） | 审查代码质量、一致性 |
| Tester（测试） | 生成测试用例，执行验证 |
| Human Gate | 人类介入检查点，可接受/驳回/修改 Agent 输出 |

### 3.2 方案编排层（Scheme Orchestrator）

支持通过配置描述 Agent 之间的拓扑与交互方式，平台自动执行：

```yaml
scheme:
  name: "requirements-dev-human-test"
  topology: pipeline          # pipeline | star | debate | parallel
  steps:
    - agent: analyst
      input: raw_requirement
      output: formal_spec
    - agent: developer
      input: formal_spec
      output: code_patch
    - agent: human_gate
      trigger: always          # always | on_failure | on_low_confidence
      actions: [approve, reject, modify]
    - agent: tester
      input: code_patch
      output: test_result
```

**支持的拓扑结构（对应 M10 Topology Efficiency）：**

- `pipeline`：线性流水线，每步输出作为下步输入
- `star`：中心协调者分发任务给多个 worker agent
- `debate`：多 agent 对同一问题给出方案并投票/仲裁
- `parallel`：多 agent 并行执行后 merge
- `graph`：任意有向图（最通用）

### 3.3 运行时引擎（Agent Runtime）

负责：
- Agent 实例化与生命周期管理
- 消息路由（Agent 间通信）
- 工具执行（代码运行、文件操作、测试调用）
- 状态快照（支持断点续跑）
- 全程日志采集（供评测层使用）

```
Task Input
    │
    ▼
Orchestrator.run(scheme, task)
    │
    ├── spawn agents
    ├── route messages per topology
    ├── invoke tools
    ├── check human gate (if configured)
    └── collect metrics at each step
    │
    ▼
Task Output + Metrics Log
```

### 3.4 人类介入机制（Human-in-the-Loop Gate）

对应 GQM Q5（M20-M23），人类介入点可配置触发条件：

| 触发条件 | 说明 |
|----------|------|
| `always` | 每次必经人工审核 |
| `on_low_confidence` | Agent 置信度低于阈值时触发 |
| `on_conflict` | Agent 间出现冲突时触发 |
| `on_failure` | 测试失败时触发 |

介入操作类型：
- **Approve**：直接通过
- **Reject**：打回重做，附反馈
- **Modify**：人类直接修改 Agent 输出
- **Redirect**：重新指定策略或文件范围

平台记录每次介入的时间戳、操作类型、修改量（用于计算 HIF、HTC、HER、AR）。

### 3.5 评测采集层（Metrics Collector）

自动从运行日志中计算 GQM 定义的 23 项指标：

**Q1 质量指标：**
- M1: Task Success Rate（SWE-bench）
- M2: Code Correctness pass@k（HumanEval）
- M3: Requirement Formalization Consistency（TVR tuple-level F1 + CPR）
- M4: Verification Alignment（STC + ABC）
- M5: Implementation Alignment（Recall/Precision on changed files）

**Q2 协作有效性：**
- M6: Milestone KPI = n_j / M
- M7: Communication Score（LLM-as-a-judge，0-5）
- M8: Planning Score（LLM-as-a-judge，0-5）
- M9: Collaboration Efficiency = 任务成功 / CoordinationCost

**Q3 拓扑结构：**
- M10: Topology Efficiency（各拓扑成功率对比）
- M11: Convergence Speed（首次成功平均轮数）
- M12: Scalability（不同 agent 数量下的性能变化 SG）
- M13: Structural Coordination Cost（Redundancy + Latency + Cost）

**Q4 交互机制：**
- M14: Competitive Success Rate
- M15: Strategy Adaptation（ESA）
- M16: Conflict Resolution Rate（CRR）
- M17: Mechanism Gap（协作 vs 竞争性能差）
- M18: Stability（Var(performance)）
- M19: Robustness（Drop 值）

**Q5 人类介入：**
- M20: Human Intervention Frequency（HIF）
- M21: Human Time Cost（HTC）
- M22: Human Edit Ratio（HER）
- M23: Acceptance Rate（AR + ValueRate）

---

## 4. 协作方案设计

### 方案 A：需求→开发→人工→测试 Pipeline

```
需求输入 → [Analyst] → 形式化规格
         → [Developer] → 代码补丁
         → [Human Gate] → 审核/修改
         → [Tester] → 测试结果 → 循环或完成
```

特点：人类介入在开发之后，可在最关键的代码产出环节把关。

### 方案 B：多 Developer 竞争 + Reviewer 仲裁

```
需求输入 → [Analyst]
         → [Developer×N] 并行生成 N 个方案
         → [Reviewer] 评分选优 or 融合
         → [Tester] → 验证
```

对应 M14（竞争成功率）、M17（机制差距）研究。

### 方案 C：Debate 迭代精炼

```
需求输入 → [Analyst]
         → [Developer] 初版 → [Reviewer] 批评
         → [Developer] 修改 → [Reviewer] 再批评
         → ... （多轮）
         → [Human Gate]（可选）→ [Tester]
```

对应 M11（收敛速度）、M15（策略适应）研究。

### 方案 D：星型（Star）中心协调

```
需求输入 → [Coordinator]
         ├→ [Dev-Frontend]
         ├→ [Dev-Backend]
         └→ [Dev-Database]
         → Coordinator 合并 → [Tester]
```

对应 M10（拓扑效率）、M12（规模扩展性）研究。

### 方案 E：单 Agent 基线

```
需求输入 → [Single LLM Agent] → 代码 → [Tester]
```

作为对照组（baseline），与所有多 agent 方案对比。

---

## 5. 数据输入与指标采集实现

### 5.1 统一任务输入格式（TaskInput）

不同数据集在加载时统一转换为同一结构，Agent 只看到 `TaskInput`，不感知数据集来源：

```python
@dataclass
class TaskInput:
    task_id: str          # 唯一标识，如 "HumanEval/42" 或 "MBPP/301"
    description: str      # 自然语言问题描述（作为 Agent 的输入）
    entry_point: str      # 函数名（用于拼接测试调用）
    tests: list[str]      # 单元测试代码列表
    source: str           # "humaneval" | "mbpp"
```

### 5.2 数据集加载器

```python
# HumanEval：从 openai/human-eval 官方格式加载
class HumanEvalLoader:
    def load(self) -> list[TaskInput]:
        # 读取 HumanEval.jsonl
        # 字段映射：prompt→description, entry_point→entry_point, test→tests
        ...

# MBPP：从 google-research/mbpp 官方格式加载
class MBPPLoader:
    def load(self, split="test") -> list[TaskInput]:
        # 读取 mbpp.jsonl，split=test 取 500 题
        # 字段映射：text→description, test_list→tests（3条）
        ...
```

### 5.3 代码执行沙箱

Agent 生成的代码需在隔离环境中执行，避免副作用：

```
生成代码
   │
   ▼
SandboxRunner
   ├── 写入临时文件
   ├── docker run --rm --network none --memory 256m
   │     python -c "执行代码 + 测试"
   ├── 捕获 stdout / stderr / exitcode
   └── 返回 ExecutionResult(passed, error, runtime_ms)
```

**pass@k 计算：** 对同一题目生成 n 个候选解，记录通过数 c，按公式计算：

```
pass@k = 1 - C(n-c, k) / C(n, k)
```

对于单 agent 方案（每题只输出 1 个解）直接报 pass@1；多 agent / debate 方案天然产生多个候选，可报 pass@k（k=1,3,5）。

### 5.4 运行日志结构（EventLog）

运行时引擎在每个关键节点写入结构化事件，是所有指标的数据来源：

```jsonc
{
  "task_id": "MBPP/301",
  "scheme": "debate",
  "round": 2,
  "event": "agent_output",        // agent_output | tool_call | human_action | test_result
  "agent_id": "developer_01",
  "timestamp": 1713500000.123,
  "tokens_in": 512,
  "tokens_out": 384,
  "content": "...",               // Agent 输出文本 / 代码
  "metadata": {
    "confidence": 0.82,           // Agent 自评置信度（用于 on_low_confidence 触发）
    "milestone": "code_generated" // 里程碑标记，用于 M6 KPI 计算
  }
}
```

### 5.5 指标采集流水线（MetricsCollector）

实验结束后，从 EventLog 批量计算各项指标：

```
EventLog（JSONL 文件）
   │
   ▼
MetricsCollector.compute(log_path, task_results)
   │
   ├── pass@k          ← 从 ExecutionResult 汇总
   ├── M6 KPI          ← 统计各 agent 参与的 milestone 数
   ├── M7 CScore       ← 抽样消息 → LLM-as-a-judge 评分（0-5）
   ├── M8 PScore       ← 同上，针对规划消息
   ├── M9 CollabEff    ← pass@1 / CoordinationCost
   │                      CoordinationCost = α·msg数 + β·tokens + γ·API调用数 + δ·时间
   ├── M11 CS          ← 每题首次通过时的 round 数均值
   ├── M12 SG          ← 跨实验对比（agent 数量 1/3/5/7 的 pass@1 变化率）
   ├── M13 Cost        ← 汇总 tokens + API calls + 总时长
   ├── M18 Stability   ← 同一方案多次运行的 Var(pass@1)
   └── M20-M23         ← 从 human_action 事件统计 HIF / HTC / HER / AR
   │
   ▼
metrics.json（每次实验一份，写入 SQLite 汇总表）
```

**LLM-as-a-judge 实现（M7/M8）：**
从 EventLog 中提取 Agent 间消息，批量发送给裁判 LLM（与实验用 LLM 隔离，避免自评偏差），按评分 rubric 返回 0-5 分。为控制成本，每次实验对消息随机抽样 20% 打分。

### 5.6 数据集规格

| 数据集 | 规模 | 用途 | 对应指标 |
|--------|------|------|----------|
| HumanEval（OpenAI 2021） | 164 题 | 函数级代码生成，对标 SOTA | M2 pass@k |
| MBPP（Google 2021） | 500 题（test split） | 每题 3 条单元测试，题量充足支撑 ablation | M2 pass@k、M11、M12 |

> 注：其他数据集（SWE-bench 等）待导师确认后纳入。

---

## 6. 技术选型

| 模块 | 选型 |
|------|------|
| Agent 框架基础 | 自研（参考 AutoGen 设计，保留最大灵活性） |
| LLM 调用 | Anthropic Claude API（claude-sonnet-4-6 / opus-4-7） |
| 工具执行沙箱 | Docker 容器隔离 |
| 消息传递 | 内存队列（单机）/ Redis（分布式扩展） |
| 指标存储 | SQLite（实验记录）+ JSON log |
| 评测脚本 | Python，对接 SWE-bench / HumanEval 官方评测接口 |
| 可视化（可选） | Gradio / 简单 Web UI，展示 Agent 通信流程和指标 |