optimiz the expression

2026-05-11 11:28:24 +08:00
parent e56a256e79
commit 9ffd808628
1 changed files with 9 additions and 11 deletions
@@ -33,9 +33,9 @@ Agentic AI---where large language models are embedded in autonomous loops capabl
 AI tools have been part of the software engineer's toolkit for years---code completion, static analysis, defect prediction---but they have always operated in a supporting role. The developer decides; the tool assists. What has changed recently is the emergence of systems where that division no longer holds so cleanly. Under the label of \emph{agentic AI}, large language models (LLMs) are now embedded in execution loops that let them perceive their environment, make plans, call external tools, and update their behaviour in response to feedback, all without a human directing each step \cite{abuali2025agentic, wang2024survey}.
-For software system design, this is not just an incremental improvement. Classical architectures assume a clear boundary between human intent and machine execution. Agentic systems complicate that picture: a single agent, or a group of them, can in principle draft requirements, generate and test code, run a code review, and refactor a module---cycling through these phases without waiting for a human to issue each command \cite{jin2024llmagents}. How to structure such systems, how to get them to plan reliably, how to coordinate multiple agents, and how to measure any of this is now an open engineering and research problem.
+For software system design, this shift is not merely incremental---it represents a structural reorientation of the human--machine relationship that classical software architectures did not anticipate. Those architectures draw a sharp boundary between human intent and machine execution: the engineer specifies; the tool executes within tightly scoped preconditions. Agentic systems dissolve that boundary: a single agent, or a coordinated ensemble, can in principle traverse the entire software development lifecycle autonomously---eliciting and formalising requirements, synthesising and compiling code, executing regression suites, and performing static analysis---cycling through these phases in a planning-execution-reflection loop without requiring a human to issue each intermediate command \cite{jin2024llmagents, wang2025aiagenticprogrammingsurvey}. How to architect such systems for reliability, how to coordinate specialised agents without incurring prohibitive inter-agent communication overhead, and how to evaluate their outputs against standards that extend beyond task completion rate are now simultaneously open engineering and research problems.
-This survey reviews recent literature on agentic AI system design across four areas: foundational architectures and taxonomies; multi-agent frameworks and coordination; applications to concrete software engineering tasks; and the internal planning, reasoning, and tool-use mechanisms that agents depend on. A critical analysis of limitations and an outline of future directions follow.
+This survey interrogates how agentic AI systems are designed, evaluated, and coordinated, tracing four mutually reinforcing threads through the literature: foundational taxonomies that partition the design space between symbolic and neural paradigms; coordination mechanisms that emerge when multiple specialised agents are composed into teams; the empirical record of deploying agents across the software engineering lifecycle from requirements elicitation to post-deployment maintenance; and the internal planning, reasoning, and tool-use loops that determine whether an agent can sustain coherent behaviour over extended task horizons. A critical examination of persistent limitations---including hallucination propagation in multi-step execution, evaluation fragmentation across the lifecycle, and the near-absence of governance frameworks in published architectures---and a structured analysis of promising future directions conclude the review.
 The 13 primary papers span 2023--2026, sourced from IEEE Xplore, the ACM Digital Library, and arXiv. All are peer-reviewed conference or journal papers, or preprints with documented subsequent journal acceptance.
@@ -48,11 +48,7 @@ A systematic search was conducted across IEEE Xplore, the ACM Digital Library, a
 \emph{AI agent planning reasoning tool use};
 \emph{autonomous software development benchmark}.
-\textbf{Inclusion criteria:} (i) published or submitted after January 2023; (ii) directly addresses the architecture, capabilities, or evaluation of agentic AI systems in a software design or software engineering context; (iii) available as a full paper.
+Papers were retained when they satisfied three jointly necessary conditions: publication or submission no earlier than January 2023, substantive engagement with the architecture, capabilities, or evaluation of agentic AI systems within a software design or engineering context, and availability as a complete, citable document. The recency threshold reflects the rapid architectural evolution of transformer-based agent frameworks following the widespread deployment of instruction-tuned LLMs at scale---a development that renders most pre-2023 literature structurally distinct in its foundational assumptions about what agents can perceive, plan, and execute \cite{arunkumar2026architectures}. Excluded were studies whose scope was confined to narrow natural language processing tasks without software engineering application, as well as papers whose primary contribution was a novel pre-training methodology rather than an agentic system design; this boundary proved consequential in practice, as the pre-training and agent-deployment literatures have largely evolved in parallel with limited cross-citation. The initial search returned over 200 candidates; after de-duplication and title-and-abstract screening, 13 primary papers were retained and grouped into four thematic clusters as described in Section~\ref{sec:themes}.
 \textbf{Exclusion criteria:} (i) work focused exclusively on narrow NLP tasks without a software engineering application; (ii) papers whose primary contribution is a new LLM pre-training method rather than an agentic system design.
 The initial search returned over 200 candidates. After de-duplication and title-and-abstract screening, 13 primary papers were retained and grouped into four thematic clusters as described in Section~\ref{sec:themes}.
 \textbf{Use of AI-assisted tools.} DeepSeek was used as a supplementary aid for literature organisation and error checking in accordance with the course guidelines. All paper selection, critical analysis, and editorial judgement are the author's own.
@@ -155,7 +151,9 @@ Compared to where AI-assisted software engineering stood five years ago, the pro
 \subsection{Comparing Approaches}
-The single-agent versus multi-agent debate is not settled, and the disagreement is partly a measurement artefact. Masterman et al.\ \cite{masterman2024landscape} show that a single agent with strong reflection is competitive with multi-agent systems on a range of benchmarks and considerably easier to debug. He et al.\ \cite{ishibashi2024multiagent} and Rajendran et al.\ \cite{ieee2025multiagent} push back, arguing that specialisation in multi-agent systems produces better results on complex long-horizon tasks. Both positions are defensible given the benchmarks each paper uses; papers advocating multi-agent systems consistently evaluate on more complex tasks. Until there is a benchmark that varies task complexity as a controlled dimension, the debate will continue to generate more heat than light.
+The debate over single-agent versus multi-agent architectures remains unresolved, with the divergence stemming as much from methodological asymmetry as from genuine differences in architectural capability. Masterman et al.\ \cite{masterman2024landscape} advance the case for single-agent sufficiency: their evaluation demonstrates that an agent equipped with a complete planning-execution-reflection loop achieves competitive performance with multi-agent ensembles while incurring substantially lower coordination overhead, and their key empirical observation---that omitting the reflection phase produces characteristic brittleness, causing agents to commit to subtly wrong plans without course-correcting---suggests that architectural completeness within a single agent may substitute for distributional specialisation across an agent team. Park et al.\ \cite{park2023generative} reinforce this interpretation through their 25-agent simulation: coherent long-horizon behaviour emerges only when memory retrieval, reflection, and planning are instantiated jointly, with any two-component subset producing noticeably degraded outcomes, a non-additive interaction pattern that runs counter to the assumption that each mechanism contributes independently.
 Against this, He, Treude, and Lo \cite{ishibashi2024multiagent} argue that for tasks requiring concurrent exploration of disjoint state spaces, the sequential planning bottleneck inherent to single-agent designs becomes the binding constraint regardless of how refined each architectural component is. Rajendran et al.\ \cite{ieee2025multiagent} operationalise this advantage through an auction-based task allocation protocol in which competing agent bids surface decomposition conflicts before they propagate through the execution graph---a coordination mechanism without a natural single-agent analogue, and one whose benefit is most visible precisely on the compositionally complex tasks that single-agent evaluations tend to exclude. The experimental record is therefore difficult to reconcile on a common footing: multi-agent papers systematically evaluate on tasks of greater compositional depth, confounding architectural comparison with task difficulty. What the available data do consistently demonstrate, as SALLMA's infrastructure analysis makes explicit \cite{sallma2025}, is that coordination overhead scales superlinearly with agent count---an empirical ceiling on the multi-agent advantage that becomes binding faster than the optimistic framing of distributed architectures typically acknowledges, and one that no current framework has credibly resolved.
 \section{Future Directions}
@@ -171,11 +169,11 @@ The single-agent versus multi-agent debate is not settled, and the disagreement
 \section{Conclusion}
-The 13 papers reviewed here cover roughly three years of a field that has been moving quickly. The picture that emerges is genuinely mixed. On the positive side: open-source frameworks \cite{derouiche2025frameworks} are actively deployed, SWE-bench \cite{jimenez2024swebench} provides a shared empirical reference point, and the architectural vocabulary for memory, planning, and multi-agent coordination is now stable enough for careful comparison. That is more than could be said in 2022.
+The 13 papers reviewed here span roughly three years of a field advancing rapidly, and the picture that emerges is genuinely mixed. Open-source frameworks \cite{derouiche2025frameworks} that implement the architectural patterns described in this survey are actively deployed in production settings, and SWE-bench \cite{jimenez2024swebench} supplies a shared empirical reference point against which the community has begun to converge. Only since the instruction-tuning era has it become feasible to treat \emph{hallucination propagation}, \emph{context window saturation}, and \emph{prompt injection} as first-class engineering parameters to be managed rather than properties that categorically preclude deployment. The architectural vocabulary for memory, planning, and multi-agent coordination---collectively constituting what practitioners now call the agent loop---is stable enough for principled comparison, a state of affairs that Natural Language Processing (NLP) research alone could not have delivered without the complementary advances in agent scaffolding documented across this literature.
-On the other hand, the gaps are not minor. Hallucination in agentic loops is harder to catch and recover from than in standalone LLM usage. Evaluation practices outside patch generation remain fragmented, which means many performance claims in the literature rest on shaky ground. Security and governance are essentially absent from the architectural proposals, which will matter increasingly as these systems acquire more capabilities and broader access. And the kind of long-horizon, project-scale autonomy that would constitute a genuine shift in how software is built has not been demonstrated convincingly.
+On the other hand, the limitations identified in this survey are not peripheral details. Hallucination in agentic execution loops is qualitatively more dangerous than in single-turn generation: a tool-call chain that fails at step twenty-eight may leave a codebase in a partially modified state that requires forensic inspection to diagnose, with no straightforward rollback if the agent did not maintain a structured execution log. Nowhere in the surveyed architectural proposals do security and governance appear as first-class design concerns, despite documented demonstrations of adversarial prompt injection against deployed agentic systems \cite{abuali2025agentic}. Evaluation practices outside patch generation remain fragmented, meaning most cross-paper performance comparisons rest on incommensurable baselines, and the long-horizon project-scale autonomy that would represent a genuine shift in software development practice has not been convincingly demonstrated at scale.
-For practitioners adopting agentic AI today, the implication is not to wait, but to be deliberate: design for human oversight, invest in evaluation infrastructure, and treat the agent as an architectural component with the same quality requirements---reliability, security, maintainability---as anything else in the system \cite{sallma2025}. The technology is real; the engineering discipline around it is still catching up.
+For practitioners adopting agentic AI today, the implication is not to wait, but to proceed with deliberate architecture: designing for human oversight at each agent decision boundary, investing in evaluation infrastructure that spans the full software engineering lifecycle, and treating the agent as a system component subject to the same quality attributes---reliability, security, maintainability---that govern every other element of the stack \cite{sallma2025}. The trajectory from sub-2\% to over 50\% on SWE-bench across two years is a credible signal that further progress is achievable. The architectural vocabulary surveyed here supplies a sufficient foundation for principled system design, provided one resists conflating benchmark performance with operational dependability. What this field still requires---formal threat models, lifecycle-spanning evaluation standards, deployable governance frameworks---are not peripheral research interests but engineering preconditions for the dependable deployment that the field's own architectural ambitions implicitly demand.
 \bibliographystyle{IEEEtran}
 \bibliography{references}