7109540e18
LaTeX source and BibTeX references for a systematic literature survey on Software System Design with Agentic AI (13 papers, ACM manuscript format).
199 lines
23 KiB
TeX
199 lines
23 KiB
TeX
% Submission filename: JC3506_A1_<Surname>_<FirstName>_<StudentID>.pdf
|
|
% Course: JC3506 Software Design and Implementation
|
|
% Assessment: Individual Study — Systematic Literature Survey
|
|
% Topic: Software System Design with Agentic AI
|
|
|
|
\documentclass[manuscript, anonymous=false]{acmart}
|
|
|
|
%% Force symmetric margins (override acmart's twoside default)
|
|
\geometry{twoside=false, left=2.5cm, right=2.5cm, top=2.5cm, bottom=2.5cm}
|
|
|
|
%% ACM rights / metadata — left blank for student submission
|
|
\setcopyright{none}
|
|
\acmDOI{}
|
|
\acmISBN{}
|
|
\acmConference[JC3506]{Software Design and Implementation}{2025--2026}{University of Aberdeen}
|
|
|
|
%% Additional packages (acmart already loads hyperref, natbib, geometry)
|
|
\usepackage{booktabs}
|
|
\usepackage{float}
|
|
|
|
% -------------------------------------------------------
|
|
\begin{document}
|
|
|
|
\title{Software System Design with Agentic AI: A Systematic Literature Survey}
|
|
|
|
\author{SiFan Chen}
|
|
\affiliation{%
|
|
\institution{University of Aberdeen}
|
|
\country{United Kingdom}
|
|
}
|
|
\email{u28sc22@abdn.ac.uk}
|
|
|
|
\begin{abstract}
|
|
Agentic AI systems---large language models embedded within autonomous execution loops that perceive, plan, invoke tools, and revise behaviour---are reshaping how software is designed and built. This paper presents a systematic literature survey of 13 peer-reviewed and widely cited papers (2023--2026) on the design of software systems incorporating agentic AI. The survey organises findings into four themes: foundational architectures and taxonomies, multi-agent frameworks and coordination, applications across the software engineering lifecycle, and planning/reasoning/tool-use mechanisms. A critical analysis identifies hallucination and reliability, evaluation fragmentation, coordination scalability, and governance as the principal open challenges. Future directions include hybrid neuro-symbolic architectures, lifecycle-spanning benchmarks, persistent long-horizon memory, and principled human-agent collaboration models.
|
|
\end{abstract}
|
|
|
|
\keywords{agentic AI, software system design, large language models, multi-agent systems, autonomous software engineering}
|
|
|
|
\maketitle
|
|
|
|
% -------------------------------------------------------
|
|
\section{Introduction}
|
|
|
|
Artificial intelligence has long been applied to software engineering in forms that assist but do not act: code completion tools, static analysers, and defect predictors all augment a human developer without replacing their judgment. A qualitatively different model has now emerged under the label of \emph{agentic AI}, in which a large language model (LLM) is embedded within an autonomous execution loop that can perceive its environment, form plans, invoke external tools, and revise its behaviour based on feedback---all without step-by-step human direction \cite{schmidgall2024agentic, wang2024survey}.
|
|
|
|
This shift carries profound implications for software system design. Classical software architecture treats the system boundary as a fixed interface between human intent and machine execution. Agentic systems dissolve that boundary: a single agent or a team of collaborating agents can now draft requirements, generate and test code, perform code review, and refactor modules in a continuous loop \cite{liu2024llmagents}. The design of such systems---how agents are structured, how they plan, how they share state, and how they are evaluated---has itself become an active research area.
|
|
|
|
This survey provides a structured review of recent literature on the design of software systems that incorporate or consist of agentic AI. The review covers four interlocking themes: (1) foundational architectures and taxonomies of agentic systems; (2) multi-agent frameworks and coordination mechanisms; (3) the application of agentic AI to concrete software engineering tasks; and (4) the reasoning, planning, and tool-use capabilities that underpin agent behaviour. The survey closes with a critical analysis of current limitations and a discussion of open research directions.
|
|
|
|
The selected literature spans 2023--2026, drawn primarily from IEEE Xplore, the ACM Digital Library, and arXiv. All 13 primary sources are peer-reviewed conference or journal papers, or widely cited preprints with subsequent journal acceptance.
|
|
|
|
% -------------------------------------------------------
|
|
\section{Research Methodology}
|
|
|
|
A systematic search was conducted across IEEE Xplore, the ACM Digital Library, arXiv, and Google Scholar using the following keyword combinations:
|
|
\emph{agentic AI software system design};
|
|
\emph{LLM-based autonomous agents software engineering};
|
|
\emph{multi-agent systems LLM software architecture};
|
|
\emph{AI agent planning reasoning tool use};
|
|
\emph{autonomous software development benchmark}.
|
|
|
|
\textbf{Inclusion criteria:} (i) published or submitted after January 2023; (ii) directly addresses the architecture, capabilities, or evaluation of agentic AI systems in a software design or software engineering context; (iii) available as a full paper.
|
|
|
|
\textbf{Exclusion criteria:} (i) work focused exclusively on narrow NLP tasks without a software engineering application; (ii) papers whose primary contribution is a new LLM pre-training method rather than an agentic system design.
|
|
|
|
The initial search returned over 200 candidates. After de-duplication and title-and-abstract screening, 13 primary papers were retained and grouped into four thematic clusters as described in Section~\ref{sec:themes}.
|
|
|
|
% -------------------------------------------------------
|
|
\section{Thematic Overview}
|
|
\label{sec:themes}
|
|
|
|
The 13 selected papers are grouped into four thematic clusters in Table~\ref{tab:themes}. Several papers contribute to more than one theme; each is assigned to its dominant focus.
|
|
|
|
\begin{table*}[t]
|
|
\caption{Thematic classification of surveyed papers.}
|
|
\label{tab:themes}
|
|
\begin{tabular}{@{}lp{3.5cm}p{7cm}@{}}
|
|
\toprule
|
|
\textbf{Theme} & \textbf{Papers} & \textbf{Core focus} \\
|
|
\midrule
|
|
Foundations \& Architectures
|
|
& \cite{schmidgall2024agentic, wang2024survey, sun2026architectures, sun2025frameworks}
|
|
& Taxonomies, paradigms, and framework comparisons \\
|
|
\addlinespace
|
|
Multi-Agent Systems
|
|
& \cite{ishibashi2024multiagent, ieee2025multiagent, sallma2025}
|
|
& Coordination, communication, and architectural patterns for agent teams \\
|
|
\addlinespace
|
|
SE Applications
|
|
& \cite{liu2024llmagents, yang2024llmse, jimenez2024swebench}
|
|
& Applying agents to requirements, code generation, testing, and maintenance \\
|
|
\addlinespace
|
|
Planning, Reasoning \& Tool Use
|
|
& \cite{masterman2024landscape, park2023generative, chen2025agentic}
|
|
& Internal cognitive mechanisms and execution loops \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
\end{table*}
|
|
|
|
\subsection{Foundations and Architectures of Agentic AI Systems}
|
|
|
|
The foundational literature establishes the conceptual vocabulary and architectural patterns that the rest of the field builds upon. Schmidgall and Dornaika \cite{schmidgall2024agentic} introduce a \emph{dual-paradigm} framework that separates \emph{symbolic/classical} agents (relying on deterministic planning and persistent state machines) from \emph{neural/generative} agents (driven by stochastic generation and prompt-based orchestration). Wang et al.\ \cite{wang2024survey} propose a unified architectural model centred on three sub-systems: a \emph{brain} (the LLM), a \emph{perception} module, and an \emph{action} module. Sun et al.\ \cite{sun2026architectures} extend this by decomposing the brain into Planning, Reasoning, and Memory components. The framework survey by Sun et al.\ \cite{sun2025frameworks} maps these abstractions onto concrete open-source frameworks---AutoGen, LangGraph, CrewAI, and MetaGPT---analysing their design trade-offs.
|
|
|
|
\subsection{Multi-Agent Frameworks and Coordination}
|
|
|
|
Once individual agent architectures are established, a natural extension is composing multiple agents into collaborative systems. He, Treude, and Lo \cite{ishibashi2024multiagent} provide a literature review of LLM-based multi-agent (LMA) systems within the software development lifecycle, identifying coordination and trust challenges that arise when agents take on specialised roles. Rajendran et al.\ \cite{ieee2025multiagent} present a conceptual framework for software design and refactoring using auction-based task allocation and consensus protocols to manage agent disagreement. Becattini, Verdecchia, and Vicario \cite{sallma2025} address the architectural layer directly with SALLMA, a reference software architecture that specifies interfaces, shared state management, and real-time agent communication.
|
|
|
|
\subsection{Tool Use, Planning, and Reasoning}
|
|
|
|
The internal mechanisms that allow agents to decompose goals and invoke external resources are surveyed by Masterman et al.\ \cite{masterman2024landscape} and Chen et al.\ \cite{chen2025agentic}. Masterman et al.\ examine single-agent and multi-agent implementations and identify three critical phases---\emph{planning}, \emph{execution}, and \emph{reflection}---present in robust systems. Chen et al.\ focus on \emph{agentic programming} as an emerging paradigm in which agents autonomously iterate on a task. Park et al.\ \cite{park2023generative} provide a foundational empirical study demonstrating that architectures combining memory retrieval, reflection, and planning can produce coherent long-horizon behaviour.
|
|
|
|
\subsection{Applications in Software Engineering}
|
|
|
|
Three papers evaluate agentic AI directly against software engineering tasks. Jin et al.\ \cite{liu2024llmagents} conduct a broad survey covering six SE domains, establishing clear distinctions between standalone LLMs and agent-based systems in terms of autonomy and self-improvement. Liu et al.\ \cite{yang2024llmse} categorise 124 papers from both the SE and agent-capability perspectives, showing that tool-augmented agents consistently outperform standalone models. Jimenez et al.\ \cite{jimenez2024swebench} introduce SWE-bench, a benchmark of 2,294 real-world GitHub issues drawn from 12 Python repositories, providing the field's most widely used empirical measuring stick.
|
|
|
|
% -------------------------------------------------------
|
|
\section{Detailed Discussion}
|
|
|
|
\subsection{Foundations and Architectures}
|
|
|
|
The dual-paradigm framework of Schmidgall and Dornaika \cite{schmidgall2024agentic} resolves a persistent ambiguity: earlier surveys grouped rule-based planners from the 1980s with modern LLM-driven agents, obscuring fundamental differences in uncertainty handling and knowledge representation. By separating symbolic and neural lineages, the authors provide a principled basis for architectural selection. Their analysis of 90 studies (2018--2025) shows that symbolic agents dominate safety-critical settings where determinism and formal verification are required, while neural agents prevail in adaptive, data-rich environments.
|
|
|
|
Wang et al.\ \cite{wang2024survey} complement this with component-level analysis. Their architecture positions the LLM as a central reasoning engine. Memory is divided into \emph{in-context} (working) memory and \emph{external} memory (vector databases, knowledge graphs)---a distinction with direct engineering implications: in-context memory is bounded by the model's context window, while external memory scales arbitrarily but introduces retrieval latency and recall errors.
|
|
|
|
Sun et al.\ \cite{sun2026architectures} extend the taxonomy to evaluation, arguing that agents should be assessed across all five architectural layers rather than solely by task completion rate. The authors document how early agent loops such as ReAct adopted flat sequential structures, while more recent designs use hierarchical search and recursive decomposition for non-linear problem solving. The framework comparison in Sun et al.\ \cite{sun2025frameworks} translates these abstractions into engineering decisions: LangGraph's graph-based execution model supports stateful, cyclical workflows, whereas CrewAI prioritises ease of configuration for role-based pipelines.
|
|
|
|
\subsection{Multi-Agent Frameworks}
|
|
|
|
He, Treude, and Lo \cite{ishibashi2024multiagent} identify a key architectural tension: specialisation versus coordination overhead. Highly specialised agents achieve higher domain quality but require robust inter-agent communication protocols to resolve conflicts. The authors propose a research agenda centred on improving individual agent capabilities while simultaneously optimising the collaboration layer.
|
|
|
|
Rajendran et al.\ \cite{ieee2025multiagent} operationalise this in a conceptual framework targeting software design and refactoring. Their system decomposes a change request into subtasks auctioned among specialised agents; a consensus protocol arbitrates conflicting outputs. SALLMA \cite{sallma2025} operates at a lower level of abstraction, separating agent logic from infrastructure concerns and prescribing both relational databases for structured metadata and NoSQL stores for unstructured agent memory. By formalising the architecture, SALLMA enables the application of standard software quality attributes---availability, scalability, maintainability---to agentic systems.
|
|
|
|
\subsection{Software Engineering Applications}
|
|
|
|
Jin et al.\ \cite{liu2024llmagents} survey six SE lifecycle domains. In requirements engineering, agents can elicit requirements through dialogue and generate formal specifications. In code generation, agent-based approaches outperform standalone LLM prompting by iterating on failing test cases. In software design, agents are used to generate class diagrams, API contracts, and architecture documentation. The survey concludes that the field lacks unified evaluation standards, making cross-paper comparison difficult.
|
|
|
|
Liu et al.\ \cite{yang2024llmse} address this gap by analysing 124 papers from both SE-task and agent-capability perspectives. They find that the most impactful capability additions are \emph{tool augmentation} (granting the agent access to compilers, test runners, and search engines) and \emph{memory mechanisms} (allowing agents to maintain project-level context across sessions). Multi-agent coordination provides further gains on tasks requiring parallel exploration of solution spaces.
|
|
|
|
SWE-bench \cite{jimenez2024swebench} provides the most direct empirical evidence of the state of the art. Its 2,294 tasks require agents to understand issue descriptions, navigate large codebases, and produce multi-file patches that pass existing test suites. The 1.96\% success rate achieved by Claude~2 at publication time, rising to over 50\% for leading systems by 2025, demonstrates rapid progress but also the continued gap between agents and skilled developers on complex, open-ended tasks.
|
|
|
|
\subsection{Planning, Reasoning, and Tool Use}
|
|
|
|
Masterman et al.\ \cite{masterman2024landscape} identify the planning-execution-reflection loop as the most consequential architectural decision in agentic system design. Systems that omit reflection are brittle; those incorporating structured self-critique (e.g., chain-of-thought self-evaluation) are more robust but incur higher token costs and latency. The survey also finds that multi-agent systems benefit from explicit leadership structures: designating an orchestrator agent reduces redundant computation and prevents conflicting execution states.
|
|
|
|
Chen et al.\ \cite{chen2025agentic} take a programming-paradigm view, defining \emph{agentic programming} as a methodology in which the LLM agent acts as both programmer and executor: writing code, running it, observing output, and revising iteratively. This loop resembles test-driven development, and the authors argue that existing software engineering practices---continuous integration, version control, code review---can be adapted to constrain and validate agentic execution.
|
|
|
|
Park et al.\ \cite{park2023generative} provide a foundational empirical study of long-horizon agent behaviour. Their 25-agent simulation demonstrates that combining three mechanisms---\emph{memory stream}, \emph{reflection}, and \emph{planning}---produces coherent, believable autonomous behaviour. The work is significant because it validates the three-component architecture at a fidelity not previously demonstrated.
|
|
|
|
% -------------------------------------------------------
|
|
\section{Critical Analysis}
|
|
|
|
\subsection{Advancements}
|
|
|
|
The literature represents a substantial advance over the state of AI-assisted software engineering five years ago. The conceptual vocabulary has matured: terms such as \emph{tool augmentation}, \emph{reflection}, and \emph{multi-agent coordination} now carry reasonably consistent definitions \cite{schmidgall2024agentic, wang2024survey, masterman2024landscape}. Architectural patterns have been formalised to the point where they can be instantiated in open-source frameworks and evaluated against reproducible benchmarks \cite{sun2025frameworks, jimenez2024swebench}. Performance on software engineering tasks improved rapidly: SWE-bench resolution rates climbed from under 2\% in 2023 to over 50\% by 2025.
|
|
|
|
\subsection{Challenges and Limitations}
|
|
|
|
\textbf{Reliability and hallucination.} Neural agents inherit the hallucination problem of their underlying LLMs \cite{schmidgall2024agentic, liu2024llmagents}. Unlike a standalone LLM response, an agentic system may execute a hallucinated plan across dozens of tool calls before the error becomes apparent, causing compounding damage that is difficult to reverse.
|
|
|
|
\textbf{Evaluation fragmentation.} Both Jin et al.\ \cite{liu2024llmagents} and Liu et al.\ \cite{yang2024llmse} note the lack of unified evaluation standards. SWE-bench \cite{jimenez2024swebench} addresses this for patch generation, but no comparable benchmark exists for requirements engineering, architecture design, or system-level testing.
|
|
|
|
\textbf{Coordination scalability.} The auction and consensus mechanisms in \cite{ieee2025multiagent} and the architectural guidelines in SALLMA \cite{sallma2025} address multi-agent coordination at small-to-medium scales. How these approaches perform with dozens or hundreds of concurrent agents remains largely unexplored.
|
|
|
|
\textbf{Context window limits.} The finite context window of current LLMs constrains project-level state \cite{yang2024llmse, chen2025agentic}. External memory mitigates this but introduces retrieval accuracy degradation as the knowledge base grows.
|
|
|
|
\textbf{Security and governance.} Schmidgall and Dornaika \cite{schmidgall2024agentic} identify governance deficits as one of the most critical research gaps. An agent with access to a file system, compiler, and network interface represents a significant attack surface; prompt injection attacks have been demonstrated in practice but are not addressed by any of the surveyed architectural designs.
|
|
|
|
\subsection{Comparing Approaches}
|
|
|
|
A notable disagreement concerns the relative merits of single-agent versus multi-agent designs. Masterman et al.\ \cite{masterman2024landscape} find that single-agent systems with strong reflection are competitive with multi-agent systems on many benchmarks while being simpler to debug. He et al.\ \cite{ishibashi2024multiagent} and Rajendran et al.\ \cite{ieee2025multiagent} argue that specialisation in multi-agent systems produces qualitatively better results for complex, long-horizon tasks. The discrepancy is partly methodological: papers advocating multi-agent systems tend to evaluate on more complex tasks. A unified benchmark spanning task complexity would resolve this debate.
|
|
|
|
% -------------------------------------------------------
|
|
\section{Future Directions}
|
|
|
|
\textbf{Hybrid neuro-symbolic architectures.} Schmidgall and Dornaika \cite{schmidgall2024agentic} explicitly call for hybrid designs that combine the flexibility of neural agents with the determinism and verifiability of symbolic planners. A symbolic planner could verify the safety of a neural agent's proposed plan before execution, providing formal guarantees currently absent from purely neural systems.
|
|
|
|
\textbf{Standardised evaluation frameworks.} The evaluation gap identified by Jin et al.\ \cite{liu2024llmagents} and Liu et al.\ \cite{yang2024llmse} needs benchmarks spanning the full development lifecycle---not just code generation. Future work should develop equivalents to SWE-bench \cite{jimenez2024swebench} for requirements elicitation, high-level design, and system integration testing.
|
|
|
|
\textbf{Long-horizon autonomy and persistent memory.} Park et al.\ \cite{park2023generative} demonstrate the potential of persistent memory and reflection, but their simulation is far simpler than a real software project. Future research should investigate how memory mechanisms scale when agents must track thousands of source files and evolving requirements over months-long cycles. Techniques from continual learning appear particularly relevant.
|
|
|
|
\textbf{Security and trust.} The governance gaps flagged by Schmidgall and Dornaika \cite{schmidgall2024agentic} indicate that security engineering for agentic systems is largely open. Formal threat models, sandboxing mechanisms, and audit-log designs that allow operators to verify agent behaviour after the fact are all needed.
|
|
|
|
\textbf{Human-agent collaboration models.} He et al.\ \cite{ishibashi2024multiagent} and Chen et al.\ \cite{chen2025agentic} suggest that the most productive near-term model is collaborative: humans and agents share responsibility across the lifecycle. Designing effective interaction protocols---when an agent should ask for clarification, how human corrections propagate through a plan, and how to represent agent uncertainty to non-expert stakeholders---remains an open problem.
|
|
|
|
% -------------------------------------------------------
|
|
\section{Conclusion}
|
|
|
|
This survey has reviewed 13 papers published between 2023 and 2026 on the design of software systems incorporating agentic AI. The reviewed literature demonstrates that agentic AI has moved from a theoretical concept to a practical engineering challenge: open-source frameworks \cite{sun2025frameworks} are in active deployment, benchmarks \cite{jimenez2024swebench} provide reproducible measures of progress, and architectural patterns for memory, planning, and multi-agent coordination have been formalised sufficiently for critical comparison.
|
|
|
|
At the same time, the survey reveals that the field is far from maturity. Hallucination and unreliable planning constrain the autonomy that can be safely delegated. Evaluation standards remain fragmented. Governance and security frameworks are essentially absent from proposed architectural designs. And the long-horizon, project-scale autonomy that would represent a genuine transformation of software practice has not yet been convincingly demonstrated.
|
|
|
|
The implications for software system design are clear: practitioners adopting agentic AI today must design for human oversight, invest in robust evaluation infrastructure, and treat the agent as an architectural component subject to the same quality attributes---reliability, security, maintainability---as any other system component \cite{sallma2025}. Researchers, meanwhile, have a rich agenda whose resolution will determine how quickly the field moves from promising demonstrations to dependable practice.
|
|
|
|
% -------------------------------------------------------
|
|
\bibliographystyle{ACM-Reference-Format}
|
|
\bibliography{references}
|
|
|
|
\end{document}
|