csf123321/multi-agent

Fork 0

Files

T

csf123321 688d247888 更新 docs/draft.md

2026-04-21 10:37:53 +00:00

15 KiB

Raw Blame History

多 Agent 智能软件开发平台 —— 架构设计文档

1. 项目概述

1.1 目标

在软件工程任务场景下，构建一个可灵活定义和编排多 Agent 的实验平台，支持多种协作/竞争方案的快速切换与对比，最终通过系统性评测验证出性能最优的 Agent 交互方案。

1.2 核心研究问题（来自 GQM v2）

编号	问题
Q1	多智能体是否提升了任务完成质量，并保持跨层表示一致性？
Q2	协作是否真的有效？
Q3	协作结构是否影响性能？
Q4	不同交互机制（协作/竞争）如何影响探索、适应、冲突处理与系统稳定性？
Q5	人类介入是否带来可接受的成本，并提升系统可靠性？

2. 系统整体架构

┌─────────────────────────────────────────────────────────┐
│                      实验平台 (Platform Layer)            │
│                                                         │
│  ┌─────────────┐   ┌──────────────┐   ┌─────────────┐  │
│  │  方案配置层  │   │   运行调度层  │   │  评测采集层  │  │
│  │ Scheme DSL  │   │  Orchestrator│   │  Metrics    │  │
│  └─────────────┘   └──────────────┘   └─────────────┘  │
│           │                │                 │          │
│  ┌─────────────────────────────────────────────────┐    │
│  │                  Agent Runtime                  │    │
│  │   Agent A   ←→   Agent B   ←→   Agent C  ...   │    │
│  │                    ↕ (Human-in-the-loop)        │    │
│  │               Human Intervention Gate           │    │
│  └─────────────────────────────────────────────────┘    │
│           │                                             │
│  ┌─────────────────────────────────────────────────┐    │
│  │              Task & Environment Layer            │    │
│  │    SWE-bench / HumanEval / 自定义任务集           │    │
│  └─────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────┘

3. 核心模块设计

3.1 Agent 定义层（Agent Definition）

每个 Agent 通过配置文件（YAML / Python dataclass）定义：

agent:
  id: "dev_agent_01"
  role: developer          # roles: analyst | developer | tester | reviewer | human
  model: claude-sonnet-4-6
  system_prompt: "..."
  tools:
    - code_edit
    - file_search
    - run_tests
  memory:
    type: shared | private  # 共享记忆 or 私有记忆
  interaction_mode: cooperative | competitive

例子：

角色	职责
Analyst（需求分析）	解析自然语言需求，输出形式化规格
Developer（开发）	根据规格生成/修改代码
Reviewer（审查）	审查代码质量、一致性
Tester（测试）	生成测试用例，执行验证
Human Gate	人类介入检查点，可接受/驳回/修改 Agent 输出

3.2 方案编排层（Scheme Orchestrator）

支持通过配置描述 Agent 之间的拓扑与交互方式：

scheme:
  name: "requirements-dev-human-test"
  topology: pipeline          # pipeline | star | debate | parallel
  steps:
    - agent: analyst
      input: raw_requirement
      output: formal_spec
    - agent: developer
      input: formal_spec
      output: code_patch
    - agent: human_gate
      trigger: always          # always | on_failure | on_low_confidence
      actions: [approve, reject, modify]
    - agent: tester
      input: code_patch
      output: test_result

支持的拓扑结构：

pipeline：线性流水线，每步输出作为下步输入
star：中心协调者分发任务给多个 worker agent
debate：多 agent 对同一问题给出方案并投票/仲裁
parallel：多 agent 并行执行后 merge
graph：任意有向图

3.3 运行时引擎（Agent Runtime）

负责：

Agent 实例化与生命周期管理
消息路由（Agent 间通信）
工具执行（代码运行、文件操作、测试调用）
全程日志采集（供评测层使用）

Task Input
    │
    ▼
Orchestrator.run(scheme, task)
    │
    ├── spawn agents
    ├── route messages per topology
    ├── invoke tools
    ├── check human gate (if configured)
    └── collect metrics at each step
    │
    ▼
Task Output + Metrics Log

3.4 Agent 间通信机制（MessageBus）

拓扑决定路由规则，不决定通信接口。 Agent 只调用 send / receive，由 Orchestrator 持有路由表按拓扑转发，Agent 本身不感知自己处于哪种拓扑。

统一消息结构

@dataclass
class Message:
    id: str
    from_agent: str
    to_agent: str      # 单播；广播由 Orchestrator 展开为多条单播
    msg_type: str      # task | result | critique | approval | human_input
    content: str
    metadata: dict     # round, confidence, milestone 等附加信息

MessageBus 实现

每个 Agent 拥有独立的 inbox 队列，Orchestrator 负责将消息投递到正确的队列：

class MessageBus:
    def __init__(self):
        self.queues: dict[str, asyncio.Queue] = {}

    async def send(self, msg: Message):
        await self.queues[msg.to_agent].put(msg)

    async def receive(self, agent_id: str) -> Message:
        return await self.queues[agent_id].get()

不同拓扑下的路由规则

拓扑的差异只体现在 Orchestrator 如何填写 to_agent，消息格式和队列机制完全相同：

拓扑	路由规则
pipeline	按顺序转发，to_agent 固定为下一个节点
star	Coordinator 广播任务给所有 worker；收集全部 result 后合并
debate	每轮将所有 Agent 的输出广播给其他所有 Agent
parallel	Orchestrator 同时广播，等 all result 到齐后 merge

Human Gate 作为特殊 Agent

Human Gate 实现与普通 Agent 完全相同的 receive / send 接口，内部不调用 LLM，而是阻塞等待人类输入：

class HumanGateAgent:
    async def run(self, msg: Message) -> Message:
        if not self.should_trigger(msg):          # 触发条件由 Orchestrator 在路由时判断
            return Message(msg_type="approval", content="auto_approved")

        display(msg.content)                       # 展示给人类（CLI 或 Web UI）
        action, feedback = await self.wait_for_human_input()  # 阻塞等待
        return Message(msg_type=action, content=feedback)

触发条件判断在 Orchestrator 路由层完成——不满足则跳过 Human Gate 节点，满足才投递消息到其 inbox。这样 Human Gate 可以插入任意拓扑的任意位置，不影响其他 Agent 的通信逻辑。

3.5 评测采集层（Metrics Collector）

自动从运行日志中计算 GQM 定义的 23 项指标：

Q1 质量指标：

M1: Task Success Rate（SWE-bench）
M2: Code Correctness pass@k（HumanEval）
M3: Requirement Formalization Consistency（TVR tuple-level F1 + CPR）
M4: Verification Alignment（STC + ABC）?
M5: Implementation Alignment（Recall/Precision on changed files）

Q2 协作有效性：

M6: Milestone KPI = n_j / M
M7: Communication Score（LLM-as-a-judge，0-5）
M8: Planning Score（LLM-as-a-judge，0-5）
M9: Collaboration Efficiency = 任务成功 / CoordinationCost

Q3 拓扑结构：

M10: Topology Efficiency（各拓扑成功率对比）
M11: Convergence Speed（首次成功平均轮数）
M12: Scalability（不同 agent 数量下的性能变化 SG）
M13: Structural Coordination Cost（Redundancy + Latency + Cost）

Q4 交互机制：

M14: Competitive Success Rate
M15: Strategy Adaptation（ESA）
M16: Conflict Resolution Rate（CRR）
M17: Mechanism Gap（协作 vs 竞争性能差）
M18: Stability（Var(performance)）
M19: Robustness（Drop 值）

Q5 人类介入：

M20: Human Intervention Frequency（HIF）
M21: Human Time Cost（HTC）
M22: Human Edit Ratio（HER）
M23: Acceptance Rate（AR + ValueRate）

4. 协作方案设计

方案 A：需求→开发→人工→测试 Pipeline

需求输入 → [Analyst] → 形式化规格
         → [Developer] → 代码补丁
         → [Human Gate] → 审核/修改
         → [Tester] → 测试结果 → 循环或完成

特点：人类介入在开发之后，可在最关键的代码产出环节把关。

方案 B：多 Developer 竞争 + Reviewer 仲裁

需求输入 → [Analyst]
         → [Developer×N] 并行生成 N 个方案
         → [Reviewer] 评分选优 or 融合
         → [Tester] → 验证

对应 M14（竞争成功率）、M17（机制差距）研究。

方案 C：Debate 迭代精炼

需求输入 → [Analyst]
         → [Developer] 初版 → [Reviewer] 批评
         → [Developer] 修改 → [Reviewer] 再批评
         → ... （多轮）
         → [Human Gate]（可选）→ [Tester]

对应 M11（收敛速度）、M15（策略适应）研究。

方案 D：星型（Star）中心协调

需求输入 → [Coordinator]
         ├→ [Dev-Frontend]
         ├→ [Dev-Backend]
         └→ [Dev-Database]
         → Coordinator 合并 → [Tester]

对应 M10（拓扑效率）、M12（规模扩展性）研究。

方案 E：单 Agent 基线

需求输入 → [Single LLM Agent] → 代码 → [Tester]

作为对照组（baseline），与所有多 agent 方案对比。

5. 数据输入与指标采集实现

5.1 统一任务输入格式（TaskInput）

不同数据集在加载时统一转换为同一结构，Agent 只看到 TaskInput，不感知数据集来源：

@dataclass
class TaskInput:
    task_id: str          # 唯一标识，如 "HumanEval/42" 或 "MBPP/301"
    description: str      # 自然语言问题描述（作为 Agent 的输入）
    source: str           # "humaneval" | "mbpp"

5.2 数据集加载器

# HumanEval：从 openai/human-eval 官方格式加载
class HumanEvalLoader:
    def load(self) -> list[TaskInput]:
        # 读取 HumanEval.jsonl
        # 字段映射：prompt→description, entry_point→entry_point, test→tests
        ...

# MBPP：从 google-research/mbpp 官方格式加载
class MBPPLoader:
    def load(self, split="test") -> list[TaskInput]:
        # 读取 mbpp.jsonl，split=test 取 500 题
        # 字段映射：text→description, test_list→tests（3条）
        ...

5.3 代码执行沙箱

Agent 生成的代码需在隔离环境中执行，避免副作用：

生成代码
   │
   ▼
SandboxRunner
   ├── 写入临时文件
   ├── docker run --rm --network none --memory 256m
   │     python -c "执行代码 + 测试"
   ├── 捕获 stdout / stderr / exitcode
   └── 返回 ExecutionResult(passed, error, runtime_ms)

pass@k 计算： 对同一题目生成 n 个候选解，记录通过数 c，按公式计算：

pass@k = 1 - C(n-c, k) / C(n, k)

对于单 agent 方案（每题只输出 1 个解）直接报 pass@1；多 agent / debate 方案天然产生多个候选，可报 pass@k（k=1,3,5）。

5.4 运行日志结构（EventLog）

运行时引擎在每个关键节点写入结构化事件，是所有指标的数据来源：

{
  "task_id": "MBPP/301",
  "scheme": "debate",
  "round": 2,
  "event": "agent_output",        // agent_output | tool_call | human_action | test_result
  "agent_id": "developer_01",
  "timestamp": 1713500000.123,
  "tokens_in": 512,
  "tokens_out": 384,
  "content": "...",               // Agent 输出文本 / 代码
  "metadata": {
    "confidence": 0.82,           // Agent 自评置信度（用于 on_low_confidence 触发）
    "milestone": "code_generated" // 里程碑标记，用于 M6 KPI 计算
  }
}

5.5 指标采集流水线（MetricsCollector）

实验结束后，从 EventLog 批量计算各项指标：

EventLog（JSONL 文件）
   │
   ▼
MetricsCollector.compute(log_path, task_results)
   │
   ├── pass@k          ← 从 ExecutionResult 汇总
   ├── M6 KPI          ← 统计各 agent 参与的 milestone 数
   ├── M7 CScore       ← 抽样消息 → LLM-as-a-judge 评分（0-5）
   ├── M8 PScore       ← 同上，针对规划消息
   ├── M9 CollabEff    ← pass@1 / CoordinationCost
   │                      CoordinationCost = α·msg数 + β·tokens + γ·API调用数 + δ·时间
   ├── M11 CS          ← 每题首次通过时的 round 数均值
   ├── M12 SG          ← 跨实验对比（agent 数量 1/3/5/7 的 pass@1 变化率）
   ├── M13 Cost        ← 汇总 tokens + API calls + 总时长
   ├── M18 Stability   ← 同一方案多次运行的 Var(pass@1)
   └── M20-M23         ← 从 human_action 事件统计 HIF / HTC / HER / AR
   │
   ▼
metrics.json（每次实验一份，写入 SQLite 汇总表）

LLM-as-a-judge 实现（M7/M8）： 从 EventLog 中提取 Agent 间消息，批量发送给裁判 LLM（与实验用 LLM 隔离，避免自评偏差），按评分 rubric 返回 0-5 分。为控制成本，每次实验对消息随机抽样 20% 打分。

5.6 数据集规格

数据集	规模	用途	对应指标
HumanEval（OpenAI 2021）	164 题	函数级代码生成，对标 SOTA	M2 pass@k、M1
MBPP（Google 2021）	500 题（test split）	每题 3 条单元测试，题量充足支撑 ablation	M2 pass@k、M1

6. 技术选型

模块	选型
Agent 框架基础	自研（参考 AutoGen 设计，保留最大灵活性）
LLM 调用	待定
工具执行沙箱	Docker 容器隔离
消息传递	内存队列（单机）
指标存储	SQLite（实验记录）+ JSON log
评测脚本	Python，对接 SWE-bench / HumanEval 官方评测接口
可视化（可选）	cli

15 KiB Raw Blame History Unescape Escape