Compare commits
6 Commits
7109540e18
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
| 0a1fee94a8 | |||
| 9ffd808628 | |||
| e56a256e79 | |||
| 44d8452836 | |||
| 12d8fd5de2 | |||
| 6252756893 |
@@ -1,24 +1,17 @@
|
||||
% Submission filename: JC3506_A1_<Surname>_<FirstName>_<StudentID>.pdf
|
||||
% Course: JC3506 Software Design and Implementation
|
||||
% Assessment: Individual Study — Systematic Literature Survey
|
||||
% Topic: Software System Design with Agentic AI
|
||||
|
||||
\documentclass[manuscript, anonymous=false]{acmart}
|
||||
|
||||
%% Force symmetric margins (override acmart's twoside default)
|
||||
\geometry{twoside=false, left=2.5cm, right=2.5cm, top=2.5cm, bottom=2.5cm}
|
||||
|
||||
%% ACM rights / metadata — left blank for student submission
|
||||
\setcopyright{none}
|
||||
\acmDOI{}
|
||||
\acmISBN{}
|
||||
\acmConference[JC3506]{Software Design and Implementation}{2025--2026}{University of Aberdeen}
|
||||
\settopmatter{printacmref=false}
|
||||
\renewcommand\footnotetextcopyrightpermission[1]{}
|
||||
|
||||
%% Additional packages (acmart already loads hyperref, natbib, geometry)
|
||||
\usepackage{booktabs}
|
||||
\usepackage{float}
|
||||
|
||||
% -------------------------------------------------------
|
||||
\begin{document}
|
||||
|
||||
\title{Software System Design with Agentic AI: A Systematic Literature Survey}
|
||||
@@ -31,25 +24,23 @@
|
||||
\email{u28sc22@abdn.ac.uk}
|
||||
|
||||
\begin{abstract}
|
||||
Agentic AI systems---large language models embedded within autonomous execution loops that perceive, plan, invoke tools, and revise behaviour---are reshaping how software is designed and built. This paper presents a systematic literature survey of 13 peer-reviewed and widely cited papers (2023--2026) on the design of software systems incorporating agentic AI. The survey organises findings into four themes: foundational architectures and taxonomies, multi-agent frameworks and coordination, applications across the software engineering lifecycle, and planning/reasoning/tool-use mechanisms. A critical analysis identifies hallucination and reliability, evaluation fragmentation, coordination scalability, and governance as the principal open challenges. Future directions include hybrid neuro-symbolic architectures, lifecycle-spanning benchmarks, persistent long-horizon memory, and principled human-agent collaboration models.
|
||||
Agentic AI---where large language models are embedded in autonomous loops capable of perceiving inputs, forming plans, calling external tools, and revising their own behaviour---has moved from a research curiosity to something practitioners are actively deploying. This paper surveys 13 peer-reviewed and widely cited papers from 2023--2026 on how these systems are being designed and where they fall short. Four themes structure the review: foundational architectures and taxonomies; multi-agent frameworks and coordination; applications across the software engineering lifecycle; and the planning, reasoning, and tool-use mechanisms that make agents tick. The analysis surfaces five persistent open problems: hallucination and reliability, fragmented evaluation practices, coordination overhead at scale, context window constraints, and an almost complete absence of governance frameworks. Several directions look promising---hybrid neuro-symbolic designs, lifecycle-spanning benchmarks, long-horizon persistent memory---but the path from current demonstrations to dependable practice remains considerably longer than recent benchmark numbers suggest.
|
||||
\end{abstract}
|
||||
|
||||
\keywords{agentic AI, software system design, large language models, multi-agent systems, autonomous software engineering}
|
||||
|
||||
\maketitle
|
||||
|
||||
% -------------------------------------------------------
|
||||
\section{Introduction}
|
||||
|
||||
Artificial intelligence has long been applied to software engineering in forms that assist but do not act: code completion tools, static analysers, and defect predictors all augment a human developer without replacing their judgment. A qualitatively different model has now emerged under the label of \emph{agentic AI}, in which a large language model (LLM) is embedded within an autonomous execution loop that can perceive its environment, form plans, invoke external tools, and revise its behaviour based on feedback---all without step-by-step human direction \cite{schmidgall2024agentic, wang2024survey}.
|
||||
AI tools have been part of the software engineer's toolkit for years---code completion, static analysis, defect prediction---but they have always operated in a supporting role. The developer decides; the tool assists. What has changed recently is the emergence of systems where that division no longer holds so cleanly. Under the label of \emph{agentic AI}, large language models (LLMs) are now embedded in execution loops that let them perceive their environment, make plans, call external tools, and update their behaviour in response to feedback, all without a human directing each step \cite{abuali2025agentic, wang2024survey}.
|
||||
|
||||
This shift carries profound implications for software system design. Classical software architecture treats the system boundary as a fixed interface between human intent and machine execution. Agentic systems dissolve that boundary: a single agent or a team of collaborating agents can now draft requirements, generate and test code, perform code review, and refactor modules in a continuous loop \cite{liu2024llmagents}. The design of such systems---how agents are structured, how they plan, how they share state, and how they are evaluated---has itself become an active research area.
|
||||
For software system design, this shift is not merely incremental---it represents a structural reorientation of the human--machine relationship that classical software architectures did not anticipate. Those architectures draw a sharp boundary between human intent and machine execution: the engineer specifies; the tool executes within tightly scoped preconditions. Agentic systems dissolve that boundary: a single agent, or a coordinated ensemble, can in principle traverse the entire software development lifecycle autonomously---eliciting and formalising requirements, synthesising and compiling code, executing regression suites, and performing static analysis---cycling through these phases in a planning-execution-reflection loop without requiring a human to issue each intermediate command \cite{jin2024llmagents, wang2025aiagenticprogrammingsurvey}. How to architect such systems for reliability, how to coordinate specialised agents without incurring prohibitive inter-agent communication overhead, and how to evaluate their outputs against standards that extend beyond task completion rate are now simultaneously open engineering and research problems. Understanding how to design, coordinate, and evaluate these systems is therefore one of the more pressing questions currently facing software engineering research and practice.
|
||||
|
||||
This survey provides a structured review of recent literature on the design of software systems that incorporate or consist of agentic AI. The review covers four interlocking themes: (1) foundational architectures and taxonomies of agentic systems; (2) multi-agent frameworks and coordination mechanisms; (3) the application of agentic AI to concrete software engineering tasks; and (4) the reasoning, planning, and tool-use capabilities that underpin agent behaviour. The survey closes with a critical analysis of current limitations and a discussion of open research directions.
|
||||
This survey interrogates how agentic AI systems are designed, evaluated, and coordinated, tracing four mutually reinforcing threads through the literature: foundational taxonomies that partition the design space between symbolic and neural paradigms; coordination mechanisms that emerge when multiple specialised agents are composed into teams; the empirical record of deploying agents across the software engineering lifecycle from requirements elicitation to post-deployment maintenance; and the internal planning, reasoning, and tool-use loops that determine whether an agent can sustain coherent behaviour over extended task horizons. A critical examination of persistent limitations---including hallucination propagation in multi-step execution, evaluation fragmentation across the lifecycle, and the near-absence of governance frameworks in published architectures---and a structured analysis of promising future directions conclude the review.
|
||||
|
||||
The selected literature spans 2023--2026, drawn primarily from IEEE Xplore, the ACM Digital Library, and arXiv. All 13 primary sources are peer-reviewed conference or journal papers, or widely cited preprints with subsequent journal acceptance.
|
||||
The 13 primary papers span 2023--2026, sourced from IEEE Xplore, the ACM Digital Library, and arXiv. All are peer-reviewed conference or journal papers, or preprints with documented subsequent journal acceptance.
|
||||
|
||||
% -------------------------------------------------------
|
||||
\section{Research Methodology}
|
||||
|
||||
A systematic search was conducted across IEEE Xplore, the ACM Digital Library, arXiv, and Google Scholar using the following keyword combinations:
|
||||
@@ -59,13 +50,10 @@ A systematic search was conducted across IEEE Xplore, the ACM Digital Library, a
|
||||
\emph{AI agent planning reasoning tool use};
|
||||
\emph{autonomous software development benchmark}.
|
||||
|
||||
\textbf{Inclusion criteria:} (i) published or submitted after January 2023; (ii) directly addresses the architecture, capabilities, or evaluation of agentic AI systems in a software design or software engineering context; (iii) available as a full paper.
|
||||
Papers were retained when they satisfied three jointly necessary conditions: publication or submission no earlier than January 2023, substantive engagement with the architecture, capabilities, or evaluation of agentic AI systems within a software design or engineering context, and availability as a complete, citable document. The recency threshold reflects the rapid architectural evolution of transformer-based agent frameworks following the widespread deployment of instruction-tuned LLMs at scale---a development that renders most pre-2023 literature structurally distinct in its foundational assumptions about what agents can perceive, plan, and execute \cite{arunkumar2026architectures}. Excluded were studies whose scope was confined to narrow natural language processing tasks without software engineering application, as well as papers whose primary contribution was a novel pre-training methodology rather than an agentic system design; this boundary proved consequential in practice, as the pre-training and agent-deployment literatures have largely evolved in parallel with limited cross-citation. The initial search returned over 200 candidates; after de-duplication and title-and-abstract screening, 13 primary papers were retained and grouped into four thematic clusters as described in Section~\ref{sec:themes}.
|
||||
|
||||
\textbf{Exclusion criteria:} (i) work focused exclusively on narrow NLP tasks without a software engineering application; (ii) papers whose primary contribution is a new LLM pre-training method rather than an agentic system design.
|
||||
\textbf{Use of AI-assisted tools.} DeepSeek was used as a supplementary aid for literature organisation and error checking in accordance with the course guidelines. All paper selection, critical analysis, and editorial judgement are the author's own.
|
||||
|
||||
The initial search returned over 200 candidates. After de-duplication and title-and-abstract screening, 13 primary papers were retained and grouped into four thematic clusters as described in Section~\ref{sec:themes}.
|
||||
|
||||
% -------------------------------------------------------
|
||||
\section{Thematic Overview}
|
||||
\label{sec:themes}
|
||||
|
||||
@@ -79,7 +67,7 @@ The 13 selected papers are grouped into four thematic clusters in Table~\ref{tab
|
||||
\textbf{Theme} & \textbf{Papers} & \textbf{Core focus} \\
|
||||
\midrule
|
||||
Foundations \& Architectures
|
||||
& \cite{schmidgall2024agentic, wang2024survey, sun2026architectures, sun2025frameworks}
|
||||
& \cite{abuali2025agentic, wang2024survey, arunkumar2026architectures, derouiche2025frameworks}
|
||||
& Taxonomies, paradigms, and framework comparisons \\
|
||||
\addlinespace
|
||||
Multi-Agent Systems
|
||||
@@ -87,11 +75,11 @@ Multi-Agent Systems
|
||||
& Coordination, communication, and architectural patterns for agent teams \\
|
||||
\addlinespace
|
||||
SE Applications
|
||||
& \cite{liu2024llmagents, yang2024llmse, jimenez2024swebench}
|
||||
& \cite{jin2024llmagents, liu2024llmse, jimenez2024swebench}
|
||||
& Applying agents to requirements, code generation, testing, and maintenance \\
|
||||
\addlinespace
|
||||
Planning, Reasoning \& Tool Use
|
||||
& \cite{masterman2024landscape, park2023generative, chen2025agentic}
|
||||
& \cite{masterman2024landscape, park2023generative, wang2025aiagenticprogrammingsurvey}
|
||||
& Internal cognitive mechanisms and execution loops \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
@@ -99,100 +87,97 @@ Planning, Reasoning \& Tool Use
|
||||
|
||||
\subsection{Foundations and Architectures of Agentic AI Systems}
|
||||
|
||||
The foundational literature establishes the conceptual vocabulary and architectural patterns that the rest of the field builds upon. Schmidgall and Dornaika \cite{schmidgall2024agentic} introduce a \emph{dual-paradigm} framework that separates \emph{symbolic/classical} agents (relying on deterministic planning and persistent state machines) from \emph{neural/generative} agents (driven by stochastic generation and prompt-based orchestration). Wang et al.\ \cite{wang2024survey} propose a unified architectural model centred on three sub-systems: a \emph{brain} (the LLM), a \emph{perception} module, and an \emph{action} module. Sun et al.\ \cite{sun2026architectures} extend this by decomposing the brain into Planning, Reasoning, and Memory components. The framework survey by Sun et al.\ \cite{sun2025frameworks} maps these abstractions onto concrete open-source frameworks---AutoGen, LangGraph, CrewAI, and MetaGPT---analysing their design trade-offs.
|
||||
The conceptual vocabulary of the field largely comes from four papers. Abou Ali and Dornaika \cite{abuali2025agentic} draw a line between \emph{symbolic/classical} agents---those relying on deterministic planners and explicit state machines---and \emph{neural/generative} agents driven by stochastic generation and prompt-based orchestration. This distinction, which the authors call a dual-paradigm framework, turns out to be practically useful: the two families have different failure modes and suit different deployment contexts. Wang et al.\ \cite{wang2024survey} take a more component-oriented approach, proposing a unified model with three sub-systems: a \emph{brain} (the LLM itself), a \emph{perception} module, and an \emph{action} module. Arunkumar et al.\ \cite{arunkumar2026architectures} refine this by splitting the brain into Planning, Reasoning, and Memory sub-components. Derouiche et al.\ \cite{derouiche2025frameworks} then ground these abstractions in practice, mapping them to AutoGen \cite{autogendocs}, LangGraph \cite{langgraphdocs}, CrewAI \cite{crewaidocs}, and MetaGPT and comparing their engineering trade-offs.
|
||||
|
||||
\subsection{Multi-Agent Frameworks and Coordination}
|
||||
|
||||
Once individual agent architectures are established, a natural extension is composing multiple agents into collaborative systems. He, Treude, and Lo \cite{ishibashi2024multiagent} provide a literature review of LLM-based multi-agent (LMA) systems within the software development lifecycle, identifying coordination and trust challenges that arise when agents take on specialised roles. Rajendran et al.\ \cite{ieee2025multiagent} present a conceptual framework for software design and refactoring using auction-based task allocation and consensus protocols to manage agent disagreement. Becattini, Verdecchia, and Vicario \cite{sallma2025} address the architectural layer directly with SALLMA, a reference software architecture that specifies interfaces, shared state management, and real-time agent communication.
|
||||
|
||||
\subsection{Tool Use, Planning, and Reasoning}
|
||||
|
||||
The internal mechanisms that allow agents to decompose goals and invoke external resources are surveyed by Masterman et al.\ \cite{masterman2024landscape} and Chen et al.\ \cite{chen2025agentic}. Masterman et al.\ examine single-agent and multi-agent implementations and identify three critical phases---\emph{planning}, \emph{execution}, and \emph{reflection}---present in robust systems. Chen et al.\ focus on \emph{agentic programming} as an emerging paradigm in which agents autonomously iterate on a task. Park et al.\ \cite{park2023generative} provide a foundational empirical study demonstrating that architectures combining memory retrieval, reflection, and planning can produce coherent long-horizon behaviour.
|
||||
Composing multiple agents introduces challenges that single-agent designs sidestep. He, Treude, and Lo \cite{ishibashi2024multiagent} survey LLM-based multi-agent (LMA) systems across the software development lifecycle and find that coordination and trust become the dominant concerns once agents take on specialised roles---more so than raw capability. Rajendran et al.\ \cite{ieee2025multiagent} propose a conceptual framework for software design and refactoring that handles this through auction-based task allocation and consensus protocols. The idea is that competing bids surface disagreement early rather than letting conflicting outputs propagate. SALLMA \cite{sallma2025}, from Becattini, Verdecchia, and Vicario, sits at a lower level of abstraction: it is a reference software architecture that specifies concrete interfaces for shared state and real-time agent communication, treating the multi-agent system as something an architect would actually need to deploy and maintain.
|
||||
|
||||
\subsection{Applications in Software Engineering}
|
||||
|
||||
Three papers evaluate agentic AI directly against software engineering tasks. Jin et al.\ \cite{liu2024llmagents} conduct a broad survey covering six SE domains, establishing clear distinctions between standalone LLMs and agent-based systems in terms of autonomy and self-improvement. Liu et al.\ \cite{yang2024llmse} categorise 124 papers from both the SE and agent-capability perspectives, showing that tool-augmented agents consistently outperform standalone models. Jimenez et al.\ \cite{jimenez2024swebench} introduce SWE-bench, a benchmark of 2,294 real-world GitHub issues drawn from 12 Python repositories, providing the field's most widely used empirical measuring stick.
|
||||
Jin et al.\ \cite{jin2024llmagents} survey six SE lifecycle domains and find meaningful differences between bare LLM prompting and agent-based approaches, particularly in tasks requiring iterative refinement. Liu et al.\ \cite{liu2024llmse} take a wider lens, categorising 124 papers and noting that tool augmentation and persistent memory are the two capability additions that most consistently improve results---more so than switching to a larger model. The most direct empirical reference point is SWE-bench \cite{jimenez2024swebench}: 2,294 real GitHub issues across 12 Python repositories, each requiring a multi-file patch that passes the existing test suite. It is not a gentle benchmark.
|
||||
|
||||
\subsection{Tool Use, Planning, and Reasoning}
|
||||
|
||||
Masterman et al.\ \cite{masterman2024landscape} and Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey} examine how agents actually decompose goals and make use of external tools. Masterman et al.\ identify a three-phase loop---\emph{planning}, \emph{execution}, and \emph{reflection}---and observe that omitting the reflection phase makes systems noticeably more brittle. Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey} frame this as \emph{agentic programming}: the LLM writes code, runs it, reads the output, and revises, much like a developer iterating in a REPL. Park et al.\ \cite{park2023generative} supply the empirical underpinning: a 25-agent simulation showing that combining memory retrieval, reflection, and planning produces coherent behaviour over extended time horizons in a way that any one mechanism alone does not.
|
||||
|
||||
% -------------------------------------------------------
|
||||
\section{Detailed Discussion}
|
||||
|
||||
\subsection{Foundations and Architectures}
|
||||
|
||||
The dual-paradigm framework of Schmidgall and Dornaika \cite{schmidgall2024agentic} resolves a persistent ambiguity: earlier surveys grouped rule-based planners from the 1980s with modern LLM-driven agents, obscuring fundamental differences in uncertainty handling and knowledge representation. By separating symbolic and neural lineages, the authors provide a principled basis for architectural selection. Their analysis of 90 studies (2018--2025) shows that symbolic agents dominate safety-critical settings where determinism and formal verification are required, while neural agents prevail in adaptive, data-rich environments.
|
||||
One of the more useful contributions of Abou Ali and Dornaika \cite{abuali2025agentic} is simply drawing a cleaner boundary. Earlier survey work tended to lump 1980s rule-based planners together with modern LLM-driven agents, which made it hard to reason about failure modes or architectural selection. Their dual-paradigm split---symbolic versus neural---gives practitioners a basis for that choice. Reviewing 90 studies from 2018--2025, the authors find that symbolic agents still dominate settings where determinism and formal verification matter, while neural agents have taken over wherever adaptability to messy, data-rich inputs is more important than guarantees.
|
||||
|
||||
Wang et al.\ \cite{wang2024survey} complement this with component-level analysis. Their architecture positions the LLM as a central reasoning engine. Memory is divided into \emph{in-context} (working) memory and \emph{external} memory (vector databases, knowledge graphs)---a distinction with direct engineering implications: in-context memory is bounded by the model's context window, while external memory scales arbitrarily but introduces retrieval latency and recall errors.
|
||||
Wang et al.\ \cite{wang2024survey} are less interested in lineage and more in components. Their architecture places the LLM at the centre as a reasoning engine, flanked by a perception module and an action module. The memory treatment is worth noting: they separate \emph{in-context} (working) memory, which is fast but bounded by the model's context window, from \emph{external} memory stored in vector databases or knowledge graphs. The second type scales to arbitrary size, but every retrieval is a potential source of latency and recall error---a trade-off that does not disappear as hardware improves.
|
||||
|
||||
Sun et al.\ \cite{sun2026architectures} extend the taxonomy to evaluation, arguing that agents should be assessed across all five architectural layers rather than solely by task completion rate. The authors document how early agent loops such as ReAct adopted flat sequential structures, while more recent designs use hierarchical search and recursive decomposition for non-linear problem solving. The framework comparison in Sun et al.\ \cite{sun2025frameworks} translates these abstractions into engineering decisions: LangGraph's graph-based execution model supports stateful, cyclical workflows, whereas CrewAI prioritises ease of configuration for role-based pipelines.
|
||||
Arunkumar et al.\ \cite{arunkumar2026architectures} push the taxonomy toward evaluation, arguing that task completion rate is too coarse a metric if the goal is to understand which architectural layer is actually failing. Their historical account of agent loop evolution is useful context: early designs like ReAct used flat sequential structures that are easy to implement but poor at backtracking, while more recent systems use hierarchical search and recursive decomposition to handle non-linear problem solving. Derouiche et al.\ \cite{derouiche2025frameworks} then connect these design choices to framework selection: LangGraph's \cite{langgraphdocs} graph-based execution model handles stateful, cyclical workflows well, while CrewAI \cite{crewaidocs} is easier to configure when the main requirement is a straightforward role-based pipeline.
|
||||
|
||||
\subsection{Multi-Agent Frameworks}
|
||||
|
||||
He, Treude, and Lo \cite{ishibashi2024multiagent} identify a key architectural tension: specialisation versus coordination overhead. Highly specialised agents achieve higher domain quality but require robust inter-agent communication protocols to resolve conflicts. The authors propose a research agenda centred on improving individual agent capabilities while simultaneously optimising the collaboration layer.
|
||||
The central tension He, Treude, and Lo \cite{ishibashi2024multiagent} identify is not surprising in retrospect: the more specialised your agents become, the more inter-agent communication you need to stop them from producing conflicting outputs. Their proposed research agenda---improve individual capability and coordination simultaneously---is reasonable, though it somewhat sidesteps the question of how to prioritise when resources are constrained.
|
||||
|
||||
Rajendran et al.\ \cite{ieee2025multiagent} operationalise this in a conceptual framework targeting software design and refactoring. Their system decomposes a change request into subtasks auctioned among specialised agents; a consensus protocol arbitrates conflicting outputs. SALLMA \cite{sallma2025} operates at a lower level of abstraction, separating agent logic from infrastructure concerns and prescribing both relational databases for structured metadata and NoSQL stores for unstructured agent memory. By formalising the architecture, SALLMA enables the application of standard software quality attributes---availability, scalability, maintainability---to agentic systems.
|
||||
Rajendran et al.\ \cite{ieee2025multiagent} try to operationalise coordination. Their framework decomposes a change request into subtasks and auctions them to specialised agents; when outputs conflict, a consensus protocol arbitrates. Whether auction-based allocation actually beats simpler assignment strategies in practice is not empirically established in the paper, which remains conceptual. SALLMA \cite{sallma2025} is more concrete. By separating agent logic from infrastructure and prescribing relational databases for structured metadata alongside NoSQL stores for unstructured agent memory, it treats multi-agent systems as something that has to be operated, not just designed. This framing---applying standard quality attributes like availability and maintainability to agentic systems---is one of the more practically grounded contributions in the surveyed literature.
|
||||
|
||||
\subsection{Software Engineering Applications}
|
||||
|
||||
Jin et al.\ \cite{liu2024llmagents} survey six SE lifecycle domains. In requirements engineering, agents can elicit requirements through dialogue and generate formal specifications. In code generation, agent-based approaches outperform standalone LLM prompting by iterating on failing test cases. In software design, agents are used to generate class diagrams, API contracts, and architecture documentation. The survey concludes that the field lacks unified evaluation standards, making cross-paper comparison difficult.
|
||||
Jin et al.\ \cite{jin2024llmagents} cover six SE lifecycle domains, and the picture that emerges is uneven. Requirements engineering and documentation generation look relatively tractable; the gap between agent-based and standalone LLM performance in code generation is real but narrows as tasks become more self-contained. The survey's honest conclusion---that the field lacks unified evaluation standards---means most of these comparisons rest on heterogeneous benchmarks and cannot be taken at face value.
|
||||
|
||||
Liu et al.\ \cite{yang2024llmse} address this gap by analysing 124 papers from both SE-task and agent-capability perspectives. They find that the most impactful capability additions are \emph{tool augmentation} (granting the agent access to compilers, test runners, and search engines) and \emph{memory mechanisms} (allowing agents to maintain project-level context across sessions). Multi-agent coordination provides further gains on tasks requiring parallel exploration of solution spaces.
|
||||
Liu et al.\ \cite{liu2024llmse} analyse 124 papers and find that \emph{tool augmentation} and \emph{memory mechanisms} account for more of the performance variation than model size does. Agents that can call a compiler and keep context across sessions do meaningfully better; adding more agents to the loop helps further on tasks requiring parallel exploration, but the returns diminish faster than the coordination costs suggest.
|
||||
|
||||
SWE-bench \cite{jimenez2024swebench} provides the most direct empirical evidence of the state of the art. Its 2,294 tasks require agents to understand issue descriptions, navigate large codebases, and produce multi-file patches that pass existing test suites. The 1.96\% success rate achieved by Claude~2 at publication time, rising to over 50\% for leading systems by 2025, demonstrates rapid progress but also the continued gap between agents and skilled developers on complex, open-ended tasks.
|
||||
SWE-bench \cite{jimenez2024swebench} is worth dwelling on. The 2,294 tasks come from real GitHub issue trackers, require navigating codebases of meaningful size, and only count as solved if the patch actually passes the existing test suite. Claude~2 resolved 1.96\% of them at the time of publication. Leading systems crossed 50\% by 2025, which is genuine progress---but it also means the median real-world bug is still out of reach.
|
||||
|
||||
\subsection{Planning, Reasoning, and Tool Use}
|
||||
|
||||
Masterman et al.\ \cite{masterman2024landscape} identify the planning-execution-reflection loop as the most consequential architectural decision in agentic system design. Systems that omit reflection are brittle; those incorporating structured self-critique (e.g., chain-of-thought self-evaluation) are more robust but incur higher token costs and latency. The survey also finds that multi-agent systems benefit from explicit leadership structures: designating an orchestrator agent reduces redundant computation and prevents conflicting execution states.
|
||||
Masterman et al.\ \cite{masterman2024landscape} make the case that the planning-execution-reflection loop is the single most consequential architectural choice in agentic system design. Dropping reflection makes systems brittle in a characteristic way: they commit to a plan that is slightly wrong and cannot course-correct. Adding structured self-critique (chain-of-thought self-evaluation being the most common form) recovers robustness, but the token cost and latency overhead are real considerations at production scale. The authors also observe that multi-agent systems tend to waste compute when agents work in parallel without a designated orchestrator---designating one reduces both redundant computation and conflicting execution states.
|
||||
|
||||
Chen et al.\ \cite{chen2025agentic} take a programming-paradigm view, defining \emph{agentic programming} as a methodology in which the LLM agent acts as both programmer and executor: writing code, running it, observing output, and revising iteratively. This loop resembles test-driven development, and the authors argue that existing software engineering practices---continuous integration, version control, code review---can be adapted to constrain and validate agentic execution.
|
||||
Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey} describe \emph{agentic programming} as a methodology rather than just a capability: the agent writes code, executes it, reads the output, and revises in a loop that resembles test-driven development. The more interesting claim is that standard software engineering practices---CI, version control, code review---are not obstacles to agentic execution but potential constraints that make it safer and more auditable. That argument has not been tested at scale, but it points toward an integration story that is more credible than treating agents as replacements for existing tooling.
|
||||
|
||||
Park et al.\ \cite{park2023generative} provide a foundational empirical study of long-horizon agent behaviour. Their 25-agent simulation demonstrates that combining three mechanisms---\emph{memory stream}, \emph{reflection}, and \emph{planning}---produces coherent, believable autonomous behaviour. The work is significant because it validates the three-component architecture at a fidelity not previously demonstrated.
|
||||
Park et al.\ \cite{park2023generative} provide the empirical baseline for long-horizon behaviour. A 25-agent simulation combining a \emph{memory stream}, \emph{reflection}, and \emph{planning} produced coherent behaviour over time in a way that any one of those components alone did not. The simulation context is far simpler than a real software project, but the finding that all three components are jointly necessary---not interchangeable---has influenced most subsequent architectural work.
|
||||
|
||||
% -------------------------------------------------------
|
||||
\section{Critical Analysis}
|
||||
|
||||
\subsection{Advancements}
|
||||
|
||||
The literature represents a substantial advance over the state of AI-assisted software engineering five years ago. The conceptual vocabulary has matured: terms such as \emph{tool augmentation}, \emph{reflection}, and \emph{multi-agent coordination} now carry reasonably consistent definitions \cite{schmidgall2024agentic, wang2024survey, masterman2024landscape}. Architectural patterns have been formalised to the point where they can be instantiated in open-source frameworks and evaluated against reproducible benchmarks \cite{sun2025frameworks, jimenez2024swebench}. Performance on software engineering tasks improved rapidly: SWE-bench resolution rates climbed from under 2\% in 2023 to over 50\% by 2025.
|
||||
Compared to where AI-assisted software engineering stood five years ago, the progress is real. Terms like \emph{tool augmentation}, \emph{reflection}, and \emph{multi-agent coordination} had inconsistent or no definitions in earlier literature; they now carry reasonably stable meanings across papers \cite{abuali2025agentic, wang2024survey, masterman2024landscape}. Architectural patterns have been worked out in enough detail to be implemented in open-source frameworks and measured against reproducible benchmarks \cite{derouiche2025frameworks, jimenez2024swebench}. The jump from sub-2\% to over 50\% on SWE-bench between 2023 and 2025 is the kind of trajectory that justifies the field's current attention, even if it also raises questions about what happens as the easy gains run out.
|
||||
|
||||
\subsection{Challenges and Limitations}
|
||||
|
||||
\textbf{Reliability and hallucination.} Neural agents inherit the hallucination problem of their underlying LLMs \cite{schmidgall2024agentic, liu2024llmagents}. Unlike a standalone LLM response, an agentic system may execute a hallucinated plan across dozens of tool calls before the error becomes apparent, causing compounding damage that is difficult to reverse.
|
||||
\textbf{Reliability and hallucination.} Neural agents carry the hallucination problem of their underlying LLMs into a context that amplifies it \cite{abuali2025agentic, jin2024llmagents}. When an agent executes a hallucinated plan across thirty tool calls before the error surfaces, the resulting state may be difficult or impossible to recover. This is qualitatively different from a standalone LLM producing a wrong answer that a human can discard.
|
||||
|
||||
\textbf{Evaluation fragmentation.} Both Jin et al.\ \cite{liu2024llmagents} and Liu et al.\ \cite{yang2024llmse} note the lack of unified evaluation standards. SWE-bench \cite{jimenez2024swebench} addresses this for patch generation, but no comparable benchmark exists for requirements engineering, architecture design, or system-level testing.
|
||||
\textbf{Evaluation fragmentation.} Jin et al.\ \cite{jin2024llmagents} and Liu et al.\ \cite{liu2024llmse} both flag the absence of unified evaluation standards, and it is not a minor complaint---it means most cross-paper comparisons in this survey are only approximate. SWE-bench \cite{jimenez2024swebench} closes the gap for patch generation. For requirements engineering, architecture design, and system-level testing, the field is still measuring each team's work against its own ruler.
|
||||
|
||||
\textbf{Coordination scalability.} The auction and consensus mechanisms in \cite{ieee2025multiagent} and the architectural guidelines in SALLMA \cite{sallma2025} address multi-agent coordination at small-to-medium scales. How these approaches perform with dozens or hundreds of concurrent agents remains largely unexplored.
|
||||
\textbf{Coordination scalability.} The auction and consensus mechanisms in \cite{ieee2025multiagent} and the SALLMA architecture \cite{sallma2025} were designed for systems with a handful of agents. Whether they hold up with dozens or hundreds of concurrent agents is largely untested, and there is no strong theoretical reason to expect linear scaling.
|
||||
|
||||
\textbf{Context window limits.} The finite context window of current LLMs constrains project-level state \cite{yang2024llmse, chen2025agentic}. External memory mitigates this but introduces retrieval accuracy degradation as the knowledge base grows.
|
||||
\textbf{Context window limits.} This is an architectural constraint rather than a research gap---every LLM has one, and no amount of clever prompting makes it disappear \cite{liu2024llmse, wang2025aiagenticprogrammingsurvey}. External memory pushes the problem out but does not eliminate it; retrieval accuracy degrades as the knowledge base grows, and the degradation is not always predictable.
|
||||
|
||||
\textbf{Security and governance.} Schmidgall and Dornaika \cite{schmidgall2024agentic} identify governance deficits as one of the most critical research gaps. An agent with access to a file system, compiler, and network interface represents a significant attack surface; prompt injection attacks have been demonstrated in practice but are not addressed by any of the surveyed architectural designs.
|
||||
\textbf{Security and governance.} An agent with read/write access to a file system, a compiler, and a network interface is a significant attack surface \cite{abuali2025agentic}. Prompt injection attacks against agentic systems have been demonstrated outside the lab. None of the architectural designs surveyed here treat this as a first-class concern; it appears, if at all, as a footnote on future work.
|
||||
|
||||
\subsection{Comparing Approaches}
|
||||
|
||||
A notable disagreement concerns the relative merits of single-agent versus multi-agent designs. Masterman et al.\ \cite{masterman2024landscape} find that single-agent systems with strong reflection are competitive with multi-agent systems on many benchmarks while being simpler to debug. He et al.\ \cite{ishibashi2024multiagent} and Rajendran et al.\ \cite{ieee2025multiagent} argue that specialisation in multi-agent systems produces qualitatively better results for complex, long-horizon tasks. The discrepancy is partly methodological: papers advocating multi-agent systems tend to evaluate on more complex tasks. A unified benchmark spanning task complexity would resolve this debate.
|
||||
The debate over single-agent versus multi-agent architectures remains unresolved, with the divergence stemming as much from methodological asymmetry as from genuine differences in architectural capability. Masterman et al.\ \cite{masterman2024landscape} advance the case for single-agent sufficiency: their evaluation demonstrates that an agent equipped with a complete planning-execution-reflection loop achieves competitive performance with multi-agent ensembles while incurring substantially lower coordination overhead, and their key empirical observation---that omitting the reflection phase produces characteristic brittleness, causing agents to commit to subtly wrong plans without course-correcting---suggests that architectural completeness within a single agent may substitute for distributional specialisation across an agent team. Park et al.\ \cite{park2023generative} reinforce this interpretation through their 25-agent simulation: coherent long-horizon behaviour emerges only when memory retrieval, reflection, and planning are instantiated jointly, with any two-component subset producing noticeably degraded outcomes, a non-additive interaction pattern that runs counter to the assumption that each mechanism contributes independently.
|
||||
|
||||
Against this, He, Treude, and Lo \cite{ishibashi2024multiagent} argue that for tasks requiring concurrent exploration of disjoint state spaces, the sequential planning bottleneck inherent to single-agent designs becomes the binding constraint regardless of how refined each architectural component is. Rajendran et al.\ \cite{ieee2025multiagent} operationalise this advantage through an auction-based task allocation protocol in which competing agent bids surface decomposition conflicts before they propagate through the execution graph---a coordination mechanism without a natural single-agent analogue, and one whose benefit is most visible precisely on the compositionally complex tasks that single-agent evaluations tend to exclude. The experimental record is therefore difficult to reconcile on a common footing: multi-agent papers systematically evaluate on tasks of greater compositional depth, confounding architectural comparison with task difficulty. What the available data do consistently demonstrate, as SALLMA's infrastructure analysis makes explicit \cite{sallma2025}, is that coordination overhead scales superlinearly with agent count---an empirical ceiling on the multi-agent advantage that becomes binding faster than the optimistic framing of distributed architectures typically acknowledges, and one that no current framework has credibly resolved.
|
||||
|
||||
% -------------------------------------------------------
|
||||
\section{Future Directions}
|
||||
|
||||
\textbf{Hybrid neuro-symbolic architectures.} Schmidgall and Dornaika \cite{schmidgall2024agentic} explicitly call for hybrid designs that combine the flexibility of neural agents with the determinism and verifiability of symbolic planners. A symbolic planner could verify the safety of a neural agent's proposed plan before execution, providing formal guarantees currently absent from purely neural systems.
|
||||
\textbf{Hybrid neuro-symbolic architectures.} Abou Ali and Dornaika \cite{abuali2025agentic} call for hybrid designs that pair neural flexibility with symbolic verifiability. The specific proposal---a symbolic planner that checks a neural agent's plan before execution---is one reasonable instantiation, though the hard part is specifying what ``safe'' means formally for a plan that modifies a codebase. That engineering problem is not solved by proposing the architecture.
|
||||
|
||||
\textbf{Standardised evaluation frameworks.} The evaluation gap identified by Jin et al.\ \cite{liu2024llmagents} and Liu et al.\ \cite{yang2024llmse} needs benchmarks spanning the full development lifecycle---not just code generation. Future work should develop equivalents to SWE-bench \cite{jimenez2024swebench} for requirements elicitation, high-level design, and system integration testing.
|
||||
\textbf{Evaluation beyond patch generation.} The benchmarking gap flagged by Jin et al.\ \cite{jin2024llmagents} and Liu et al.\ \cite{liu2024llmse} is probably the most practically limiting gap in the field right now. SWE-bench \cite{jimenez2024swebench} does one thing well; there is no equivalent for requirements elicitation, high-level design decisions, or system integration testing. Building such benchmarks is unglamorous work, but without them the field will keep talking past itself.
|
||||
|
||||
\textbf{Long-horizon autonomy and persistent memory.} Park et al.\ \cite{park2023generative} demonstrate the potential of persistent memory and reflection, but their simulation is far simpler than a real software project. Future research should investigate how memory mechanisms scale when agents must track thousands of source files and evolving requirements over months-long cycles. Techniques from continual learning appear particularly relevant.
|
||||
\textbf{Long-horizon memory at project scale.} Park et al.\ \cite{park2023generative} demonstrate that persistent memory and reflection work in a contained simulation. The open question is how memory mechanisms behave when an agent must track thousands of source files and requirements that evolve across a months-long project. Continual learning research is the closest relevant body of work, but it has not been seriously applied in this context.
|
||||
|
||||
\textbf{Security and trust.} The governance gaps flagged by Schmidgall and Dornaika \cite{schmidgall2024agentic} indicate that security engineering for agentic systems is largely open. Formal threat models, sandboxing mechanisms, and audit-log designs that allow operators to verify agent behaviour after the fact are all needed.
|
||||
\textbf{Security engineering for agentic systems.} The governance gap \cite{abuali2025agentic} is not just a research priority---it is a deployment risk. Formal threat models, sandboxing designs, and audit logs that let operators reconstruct what an agent did and why are all currently missing from proposed architectures. It is somewhat surprising that none of the multi-agent frameworks surveyed here treat access control as a first-class concern.
|
||||
|
||||
\textbf{Human-agent collaboration models.} He et al.\ \cite{ishibashi2024multiagent} and Chen et al.\ \cite{chen2025agentic} suggest that the most productive near-term model is collaborative: humans and agents share responsibility across the lifecycle. Designing effective interaction protocols---when an agent should ask for clarification, how human corrections propagate through a plan, and how to represent agent uncertainty to non-expert stakeholders---remains an open problem.
|
||||
\textbf{Human-agent interaction protocols.} He et al.\ \cite{ishibashi2024multiagent} and Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey} both converge on a collaborative model where humans and agents share responsibility rather than the agent operating autonomously. What that means in practice---when should an agent stop and ask for clarification, how do human corrections propagate through an existing plan, how is agent uncertainty communicated to someone without an ML background---is almost entirely unspecified in the current literature.
|
||||
|
||||
% -------------------------------------------------------
|
||||
\section{Conclusion}
|
||||
|
||||
This survey has reviewed 13 papers published between 2023 and 2026 on the design of software systems incorporating agentic AI. The reviewed literature demonstrates that agentic AI has moved from a theoretical concept to a practical engineering challenge: open-source frameworks \cite{sun2025frameworks} are in active deployment, benchmarks \cite{jimenez2024swebench} provide reproducible measures of progress, and architectural patterns for memory, planning, and multi-agent coordination have been formalised sufficiently for critical comparison.
|
||||
The 13 papers reviewed here span roughly three years of a field advancing rapidly, and the picture that emerges is genuinely mixed. Open-source frameworks \cite{derouiche2025frameworks} that implement the architectural patterns described in this survey are actively deployed in production settings, and SWE-bench \cite{jimenez2024swebench} supplies a shared empirical reference point against which the community has begun to converge. Only since the instruction-tuning era has it become feasible to treat \emph{hallucination propagation}, \emph{context window saturation}, and \emph{prompt injection} as first-class engineering parameters to be managed rather than properties that categorically preclude deployment. The architectural vocabulary for memory, planning, and multi-agent coordination---collectively constituting what practitioners now call the agent loop---is stable enough for principled comparison, a state of affairs that Natural Language Processing (NLP) research alone could not have delivered without the complementary advances in agent scaffolding documented across this literature.
|
||||
|
||||
At the same time, the survey reveals that the field is far from maturity. Hallucination and unreliable planning constrain the autonomy that can be safely delegated. Evaluation standards remain fragmented. Governance and security frameworks are essentially absent from proposed architectural designs. And the long-horizon, project-scale autonomy that would represent a genuine transformation of software practice has not yet been convincingly demonstrated.
|
||||
On the other hand, the limitations identified in this survey are not peripheral details. Hallucination in agentic execution loops is qualitatively more dangerous than in single-turn generation: a tool-call chain that fails at step twenty-eight may leave a codebase in a partially modified state that requires forensic inspection to diagnose, with no straightforward rollback if the agent did not maintain a structured execution log. Nowhere in the surveyed architectural proposals do security and governance appear as first-class design concerns, despite documented demonstrations of adversarial prompt injection against deployed agentic systems \cite{abuali2025agentic}. Evaluation practices outside patch generation remain fragmented, meaning most cross-paper performance comparisons rest on incommensurable baselines, and the long-horizon project-scale autonomy that would represent a genuine shift in software development practice has not been convincingly demonstrated at scale.
|
||||
|
||||
The implications for software system design are clear: practitioners adopting agentic AI today must design for human oversight, invest in robust evaluation infrastructure, and treat the agent as an architectural component subject to the same quality attributes---reliability, security, maintainability---as any other system component \cite{sallma2025}. Researchers, meanwhile, have a rich agenda whose resolution will determine how quickly the field moves from promising demonstrations to dependable practice.
|
||||
For practitioners adopting agentic AI today, the implication is not to wait, but to proceed with deliberate architecture: designing for human oversight at each agent decision boundary, investing in evaluation infrastructure that spans the full software engineering lifecycle, and treating the agent as a system component subject to the same quality attributes---reliability, security, maintainability---that govern every other element of the stack \cite{sallma2025}. The trajectory from sub-2\% to over 50\% on SWE-bench across two years is a credible signal that further progress is achievable. The architectural vocabulary surveyed here supplies a sufficient foundation for principled system design, provided one resists conflating benchmark performance with operational dependability. What this field still requires---formal threat models, lifecycle-spanning evaluation standards, deployable governance frameworks---are not peripheral research interests but engineering preconditions for the dependable deployment that the field's own architectural ambitions implicitly demand.
|
||||
|
||||
% -------------------------------------------------------
|
||||
\bibliographystyle{ACM-Reference-Format}
|
||||
\bibliographystyle{IEEEtran}
|
||||
\bibliography{references}
|
||||
|
||||
\end{document}
|
||||
|
||||
+121
-70
@@ -1,70 +1,78 @@
|
||||
% references.bib — JC3506 Individual Study
|
||||
% Topic: Software System Design with Agentic AI
|
||||
% Cite in text with \cite{key}
|
||||
%
|
||||
% 13 primary papers organised by theme:
|
||||
% Theme 1 — Foundations & Architectures (4 papers)
|
||||
% Theme 2 — Multi-Agent Systems & Frameworks (3 papers)
|
||||
% Theme 3 — Software Engineering Applications (3 papers)
|
||||
% Theme 4 — Planning, Reasoning & Tool Use (3 papers)
|
||||
|
||||
% -------------------------------------------------------
|
||||
% THEME 1: Foundations & Architectures of Agentic AI
|
||||
% -------------------------------------------------------
|
||||
|
||||
% Comprehensive 2024 survey — good opening citation for the introduction
|
||||
@misc{schmidgall2024agentic,
|
||||
author = {Schmidgall, Samuel and others},
|
||||
title = {Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions},
|
||||
year = {2024},
|
||||
eprint = {2510.25445},
|
||||
archivePrefix = {arXiv},
|
||||
primaryClass = {cs.AI}
|
||||
% Comprehensive 2025 survey — dual-paradigm framework (symbolic vs neural)
|
||||
@article{abuali2025agentic,
|
||||
title={Agentic AI: a comprehensive survey of architectures, applications, and future directions},
|
||||
volume={59},
|
||||
ISSN={1573-7462},
|
||||
url={http://dx.doi.org/10.1007/s10462-025-11422-4},
|
||||
DOI={10.1007/s10462-025-11422-4},
|
||||
number={1},
|
||||
journal={Artificial Intelligence Review},
|
||||
publisher={Springer Science and Business Media LLC},
|
||||
author={Abou Ali, Mohamad and Dornaika, Fadi and Charafeddine, Jinan},
|
||||
year={2025},
|
||||
month=Nov
|
||||
}
|
||||
|
||||
% Widely cited foundational survey on LLM-based autonomous agents
|
||||
@article{wang2024survey,
|
||||
author = {Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Ji-Rong},
|
||||
title = {A Survey on Large Language Model based Autonomous Agents},
|
||||
journal = {Frontiers of Computer Science},
|
||||
title={A survey on large language model based autonomous agents},
|
||||
volume={18},
|
||||
ISSN={2095-2236},
|
||||
url={http://dx.doi.org/10.1007/s11704-024-40231-1},
|
||||
DOI={10.1007/s11704-024-40231-1},
|
||||
number={6},
|
||||
pages = {186345},
|
||||
journal={Frontiers of Computer Science},
|
||||
publisher={Springer Science and Business Media LLC},
|
||||
author={Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Jirong},
|
||||
year={2024},
|
||||
doi = {10.1007/s11704-024-40231-1}
|
||||
month=Mar
|
||||
}
|
||||
|
||||
% Taxonomy of agent architectures: Perception, Brain, Planning, Action, Tools
|
||||
@misc{sun2026architectures,
|
||||
author = {Sun, Yifan and others},
|
||||
title = {Agentic Artificial Intelligence: Architectures, Taxonomies, and Evaluation of Large Language Model Agents},
|
||||
% Taxonomy: Perception, Brain, Planning, Action, Tools; evaluation framework
|
||||
@misc{arunkumar2026architectures,
|
||||
title={Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation of Large Language Model Agents},
|
||||
author={Arunkumar V and Gangadharan G. R. and Rajkumar Buyya},
|
||||
year={2026},
|
||||
eprint={2601.12560},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass = {cs.AI}
|
||||
primaryClass={cs.AI},
|
||||
url={https://arxiv.org/abs/2601.12560},
|
||||
}
|
||||
|
||||
% Covers CrewAI, LangGraph, AutoGen, MetaGPT framework comparison
|
||||
@misc{sun2025frameworks,
|
||||
author = {Sun, Yifan and others},
|
||||
% Systematic review of CrewAI, LangGraph, AutoGen, MetaGPT
|
||||
@misc{derouiche2025frameworks,
|
||||
title={Agentic AI Frameworks: Architectures, Protocols, and Design Challenges},
|
||||
author={Hana Derouiche and Zaki Brahmi and Haithem Mazeni},
|
||||
year={2025},
|
||||
eprint={2508.10146},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass = {cs.MA}
|
||||
primaryClass={cs.AI},
|
||||
url={https://arxiv.org/abs/2508.10146},
|
||||
}
|
||||
|
||||
% -------------------------------------------------------
|
||||
% THEME 2: Multi-Agent Systems & Coordination
|
||||
% -------------------------------------------------------
|
||||
|
||||
% ACM TOSEM — literature review on LLM multi-agent SE systems (peer-reviewed journal)
|
||||
@article{ishibashi2024multiagent,
|
||||
author = {Ishibashi, Yoichi and Nishimura, Yoshimasa},
|
||||
title = {{LLM}-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision and the Road Ahead},
|
||||
journal = {ACM Transactions on Software Engineering and Methodology},
|
||||
year = {2024},
|
||||
doi = {10.1145/3712003}
|
||||
author = {He, Junda and Treude, Christoph and Lo, David},
|
||||
title = {LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead},
|
||||
year = {2025},
|
||||
issue_date = {June 2025},
|
||||
publisher = {Association for Computing Machinery},
|
||||
address = {New York, NY, USA},
|
||||
volume = {34},
|
||||
number = {5},
|
||||
issn = {1049-331X},
|
||||
url = {https://doi.org/10.1145/3712003},
|
||||
doi = {10.1145/3712003},
|
||||
abstract = {Integrating Large Language Models (LLMs) into autonomous agents marks a significant shift in the research landscape by offering cognitive abilities that are competitive with human planning and reasoning. This article explores the transformative potential of integrating Large Language Models into Multi-Agent (LMA) systems for addressing complex challenges in software engineering (SE). By leveraging the collaborative and specialized abilities of multiple agents, LMA systems enable autonomous problem-solving, improve robustness, and provide scalable solutions for managing the complexity of real-world software projects. In this article, we conduct a systematic review of recent primary studies to map the current landscape of LMA applications across various stages of the software development lifecycle (SDLC). To illustrate current capabilities and limitations, we perform two case studies to demonstrate the effectiveness of state-of-the-art LMA frameworks. Additionally, we identify critical research gaps and propose a comprehensive research agenda focused on enhancing individual agent capabilities and optimizing agent synergy. Our work outlines a forward-looking vision for developing fully autonomous, scalable, and trustworthy LMA systems, laying the foundation for the evolution of Software Engineering 2.0.},
|
||||
journal = {ACM Trans. Softw. Eng. Methodol.},
|
||||
month = may,
|
||||
articleno = {124},
|
||||
numpages = {30},
|
||||
keywords = {Large Language Models, Autonomous Agents, Multi-Agent Systems, Software Engineering}
|
||||
}
|
||||
|
||||
% IEEE conference — multi-agent LLM environment for software design and refactoring
|
||||
@@ -80,7 +88,9 @@
|
||||
doi={10.1109/SoutheastCon56624.2025.10971563}
|
||||
}
|
||||
|
||||
% IEEE conference — software architecture for LLM-based multi-agent systems (SALLMA)
|
||||
|
||||
|
||||
% IEEE/ACM workshop — reference software architecture for LLM-based multi-agent systems
|
||||
@INPROCEEDINGS{sallma2025,
|
||||
author={Becattini, Marco and Verdecchia, Roberto and Vicario, Enrico},
|
||||
booktitle={2025 IEEE/ACM International Workshop New Trends in Software Architecture (SATrends)},
|
||||
@@ -90,73 +100,114 @@
|
||||
number={},
|
||||
pages={5-8},
|
||||
keywords={Structured Query Language;Software architecture;NoSQL databases;Pressing;Market research;Software;Real-time systems;Faces;Multi-agent systems;Python;software architecture;se4ai;llm},
|
||||
doi={10.1109/SATrends66715.2025.00006}
|
||||
}
|
||||
doi={10.1109/SATrends66715.2025.00006}}
|
||||
|
||||
|
||||
% -------------------------------------------------------
|
||||
% THEME 3: Software Engineering Applications
|
||||
% -------------------------------------------------------
|
||||
|
||||
% Survey of LLM agents across SE tasks: requirements, code gen, design, testing, maintenance
|
||||
@misc{liu2024llmagents,
|
||||
author = {Liu, Junwei and others},
|
||||
title = {From {LLMs} to {LLM}-based Agents for Software Engineering: A Survey of Current, Challenges and Future},
|
||||
year = {2024},
|
||||
@misc{jin2024llmagents,
|
||||
title={From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future},
|
||||
author={Haolin Jin and Linghan Huang and Haipeng Cai and Jun Yan and Bo Li and Huaming Chen},
|
||||
year={2025},
|
||||
eprint={2408.02479},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass = {cs.SE}
|
||||
primaryClass={cs.SE},
|
||||
url={https://arxiv.org/abs/2408.02479},
|
||||
}
|
||||
|
||||
% 124-paper survey from both SE and agent perspectives
|
||||
@misc{yang2024llmse,
|
||||
author = {Yang, Junwei and others},
|
||||
% 124-paper survey from both SE and agent perspectives (accepted at ACM TOSEM)
|
||||
@misc{liu2024llmse,
|
||||
title={Large Language Model-Based Agents for Software Engineering: A Survey},
|
||||
year = {2024},
|
||||
author={Junwei Liu and Kaixin Wang and Yixuan Chen and Xin Peng and Zhenpeng Chen and Lingming Zhang and Yiling Lou},
|
||||
year={2025},
|
||||
eprint={2409.02977},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass = {cs.SE}
|
||||
primaryClass={cs.SE},
|
||||
url={https://arxiv.org/abs/2409.02977},
|
||||
}
|
||||
|
||||
% SWE-bench — seminal benchmark for evaluating agents on real GitHub issues
|
||||
% SWE-bench — benchmark for evaluating agents on real GitHub issues (ICLR 2024)
|
||||
@misc{jimenez2024swebench,
|
||||
author = {Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik},
|
||||
title = {{SWE}-bench: Can Language Models Resolve Real-World {GitHub} Issues?},
|
||||
title={SWE-bench: Can Language Models Resolve Real-World GitHub Issues?},
|
||||
author={Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik Narasimhan},
|
||||
year={2024},
|
||||
eprint={2310.06770},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass = {cs.SE}
|
||||
primaryClass={cs.CL},
|
||||
url={https://arxiv.org/abs/2310.06770},
|
||||
}
|
||||
|
||||
% -------------------------------------------------------
|
||||
% THEME 4: Planning, Reasoning & Tool Use
|
||||
% -------------------------------------------------------
|
||||
|
||||
% Surveys reasoning, planning, tool-calling patterns across agent architectures
|
||||
@misc{masterman2024landscape,
|
||||
author = {Masterman, Tula and Besen, Sandi and Sawtell, Mason and Chao, Alex},
|
||||
title={The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey},
|
||||
author={Tula Masterman and Sandi Besen and Mason Sawtell and Alex Chao},
|
||||
year={2024},
|
||||
eprint={2404.11584},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass = {cs.AI}
|
||||
primaryClass={cs.AI},
|
||||
url={https://arxiv.org/abs/2404.11584},
|
||||
}
|
||||
|
||||
% Generative agents — foundational simulation of autonomous agent behaviour (UIST 2023)
|
||||
@inproceedings{park2023generative,
|
||||
author = {Park, Joon Sung and O'Brien, Joseph C. and Cai, Carrie J. and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S.},
|
||||
author = {Park, Joon Sung and O'Brien, Joseph and Cai, Carrie Jun and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S.},
|
||||
title = {Generative Agents: Interactive Simulacra of Human Behavior},
|
||||
booktitle = {Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST '23)},
|
||||
year = {2023},
|
||||
doi = {10.1145/3586183.3606763}
|
||||
isbn = {9798400701320},
|
||||
publisher = {Association for Computing Machinery},
|
||||
address = {New York, NY, USA},
|
||||
url = {https://doi.org/10.1145/3586183.3606763},
|
||||
doi = {10.1145/3586183.3606763},
|
||||
abstract = {Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents: computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day. To enable generative agents, we describe an architecture that extends a large language model to store a complete record of the agent’s experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior. We instantiate generative agents to populate an interactive sandbox environment inspired by The Sims, where end users can interact with a small town of twenty-five agents using natural language. In an evaluation, these generative agents produce believable individual and emergent social behaviors. For example, starting with only a single user-specified notion that one agent wants to throw a Valentine’s Day party, the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time. We demonstrate through ablation that the components of our agent architecture—observation, planning, and reflection—each contribute critically to the believability of agent behavior. By fusing large language models with computational interactive agents, this work introduces architectural and interaction patterns for enabling believable simulations of human behavior.},
|
||||
booktitle = {Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology},
|
||||
articleno = {2},
|
||||
numpages = {22},
|
||||
keywords = {Human-AI interaction, agents, generative AI, large language models},
|
||||
location = {San Francisco, CA, USA},
|
||||
series = {UIST '23}
|
||||
}
|
||||
|
||||
% -----------------------------------------------
|
||||
% Additional references — official framework documentation
|
||||
% -----------------------------------------------
|
||||
|
||||
% LangGraph official documentation — graph-based stateful agent workflows
|
||||
@misc{langgraphdocs,
|
||||
title={LangGraph Documentation},
|
||||
author={{LangChain AI}},
|
||||
year={2025},
|
||||
url={https://langchain-ai.github.io/langgraph/},
|
||||
note={Accessed May 2025}
|
||||
}
|
||||
|
||||
% CrewAI official documentation — role-based multi-agent orchestration framework
|
||||
@misc{crewaidocs,
|
||||
title={CrewAI Documentation},
|
||||
author={{CrewAI Inc.}},
|
||||
year={2025},
|
||||
url={https://docs.crewai.com/},
|
||||
note={Accessed May 2025}
|
||||
}
|
||||
|
||||
% AutoGen official documentation — Microsoft's conversational multi-agent framework
|
||||
@misc{autogendocs,
|
||||
title={AutoGen Documentation},
|
||||
author={{Microsoft Research}},
|
||||
year={2025},
|
||||
url={https://microsoft.github.io/autogen/},
|
||||
note={Accessed May 2025}
|
||||
}
|
||||
|
||||
% AI agentic programming: planning, memory, tool integration, execution monitoring
|
||||
@misc{chen2025agentic,
|
||||
author = {Chen, Jiannan and others},
|
||||
@misc{wang2025aiagenticprogrammingsurvey,
|
||||
title={AI Agentic Programming: A Survey of Techniques, Challenges, and Opportunities},
|
||||
author={Huanting Wang and Jingzhi Gong and Huawei Zhang and Jie Xu and Zheng Wang},
|
||||
year={2025},
|
||||
eprint={2508.11126},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass = {cs.SE}
|
||||
primaryClass={cs.SE},
|
||||
url={https://arxiv.org/abs/2508.11126},
|
||||
}
|
||||
Reference in New Issue
Block a user