SDI-homework/main.tex

\documentclass[manuscript, anonymous=false]{acmart}

\geometry{twoside=false, left=2.5cm, right=2.5cm, top=2.5cm, bottom=2.5cm}

\setcopyright{none}
\acmDOI{}
\acmISBN{}
\acmConference[JC3506]{Software Design and Implementation}{2025--2026}{University of Aberdeen}
\settopmatter{printacmref=false}
\renewcommand\footnotetextcopyrightpermission[1]{}

\usepackage{booktabs}
\usepackage{float}

\begin{document}

\title{Software System Design with Agentic AI: A Systematic Literature Survey}

\author{SiFan Chen}
\affiliation{%
  \institution{University of Aberdeen}
  \country{United Kingdom}
}
\email{u28sc22@abdn.ac.uk}

\begin{abstract}
Agentic AI---where large language models are embedded in autonomous loops capable of perceiving inputs, forming plans, calling external tools, and revising their own behaviour---has moved from a research curiosity to something practitioners are actively deploying. This paper surveys 13 peer-reviewed and widely cited papers from 2023--2026 on how these systems are being designed and where they fall short. Four themes structure the review: foundational architectures and taxonomies; multi-agent frameworks and coordination; applications across the software engineering lifecycle; and the planning, reasoning, and tool-use mechanisms that make agents tick. The analysis surfaces five persistent open problems: hallucination and reliability, fragmented evaluation practices, coordination overhead at scale, context window constraints, and an almost complete absence of governance frameworks. Several directions look promising---hybrid neuro-symbolic designs, lifecycle-spanning benchmarks, long-horizon persistent memory---but the path from current demonstrations to dependable practice remains considerably longer than recent benchmark numbers suggest.
\end{abstract}

\keywords{agentic AI, software system design, large language models, multi-agent systems, autonomous software engineering}

\maketitle

\section{Introduction}

AI tools have been part of the software engineer's toolkit for years---code completion, static analysis, defect prediction---but they have always operated in a supporting role. The developer decides; the tool assists. What has changed recently is the emergence of systems where that division no longer holds so cleanly. Under the label of \emph{agentic AI}, large language models (LLMs) are now embedded in execution loops that let them perceive their environment, make plans, call external tools, and update their behaviour in response to feedback, all without a human directing each step \cite{abuali2025agentic, wang2024survey}.

For software system design, this shift is not merely incremental---it represents a structural reorientation of the human--machine relationship that classical software architectures did not anticipate. Those architectures draw a sharp boundary between human intent and machine execution: the engineer specifies; the tool executes within tightly scoped preconditions. Agentic systems dissolve that boundary: a single agent, or a coordinated ensemble, can in principle traverse the entire software development lifecycle autonomously---eliciting and formalising requirements, synthesising and compiling code, executing regression suites, and performing static analysis---cycling through these phases in a planning-execution-reflection loop without requiring a human to issue each intermediate command \cite{jin2024llmagents, wang2025aiagenticprogrammingsurvey}. How to architect such systems for reliability, how to coordinate specialised agents without incurring prohibitive inter-agent communication overhead, and how to evaluate their outputs against standards that extend beyond task completion rate are now simultaneously open engineering and research problems. Understanding how to design, coordinate, and evaluate these systems is therefore one of the more pressing questions currently facing software engineering research and practice.

This survey interrogates how agentic AI systems are designed, evaluated, and coordinated, tracing four mutually reinforcing threads through the literature: foundational taxonomies that partition the design space between symbolic and neural paradigms; coordination mechanisms that emerge when multiple specialised agents are composed into teams; the empirical record of deploying agents across the software engineering lifecycle from requirements elicitation to post-deployment maintenance; and the internal planning, reasoning, and tool-use loops that determine whether an agent can sustain coherent behaviour over extended task horizons. A critical examination of persistent limitations---including hallucination propagation in multi-step execution, evaluation fragmentation across the lifecycle, and the near-absence of governance frameworks in published architectures---and a structured analysis of promising future directions conclude the review.

The 13 primary papers span 2023--2026, sourced from IEEE Xplore, the ACM Digital Library, and arXiv. All are peer-reviewed conference or journal papers, or preprints with documented subsequent journal acceptance.

\section{Research Methodology}

A systematic search was conducted across IEEE Xplore, the ACM Digital Library, arXiv, and Google Scholar using the following keyword combinations:
\emph{agentic AI software system design};
\emph{LLM-based autonomous agents software engineering};
\emph{multi-agent systems LLM software architecture};
\emph{AI agent planning reasoning tool use};
\emph{autonomous software development benchmark}.

Papers were retained when they satisfied three jointly necessary conditions: publication or submission no earlier than January 2023, substantive engagement with the architecture, capabilities, or evaluation of agentic AI systems within a software design or engineering context, and availability as a complete, citable document. The recency threshold reflects the rapid architectural evolution of transformer-based agent frameworks following the widespread deployment of instruction-tuned LLMs at scale---a development that renders most pre-2023 literature structurally distinct in its foundational assumptions about what agents can perceive, plan, and execute \cite{arunkumar2026architectures}. Excluded were studies whose scope was confined to narrow natural language processing tasks without software engineering application, as well as papers whose primary contribution was a novel pre-training methodology rather than an agentic system design; this boundary proved consequential in practice, as the pre-training and agent-deployment literatures have largely evolved in parallel with limited cross-citation. The initial search returned over 200 candidates; after de-duplication and title-and-abstract screening, 13 primary papers were retained and grouped into four thematic clusters as described in Section~\ref{sec:themes}.

\textbf{Use of AI-assisted tools.} DeepSeek was used as a supplementary aid for literature organisation and error checking in accordance with the course guidelines. All paper selection, critical analysis, and editorial judgement are the author's own.

\section{Thematic Overview}
\label{sec:themes}

The 13 selected papers are grouped into four thematic clusters in Table~\ref{tab:themes}. Several papers contribute to more than one theme; each is assigned to its dominant focus.

\begin{table*}[t]
\caption{Thematic classification of surveyed papers.}
\label{tab:themes}
\begin{tabular}{@{}lp{3.5cm}p{7cm}@{}}
\toprule
\textbf{Theme} & \textbf{Papers} & \textbf{Core focus} \\
\midrule
Foundations \& Architectures
  & \cite{abuali2025agentic, wang2024survey, arunkumar2026architectures, derouiche2025frameworks}
  & Taxonomies, paradigms, and framework comparisons \\
\addlinespace
Multi-Agent Systems
  & \cite{ishibashi2024multiagent, ieee2025multiagent, sallma2025}
  & Coordination, communication, and architectural patterns for agent teams \\
\addlinespace
SE Applications
  & \cite{jin2024llmagents, liu2024llmse, jimenez2024swebench}
  & Applying agents to requirements, code generation, testing, and maintenance \\
\addlinespace
Planning, Reasoning \& Tool Use
  & \cite{masterman2024landscape, park2023generative, wang2025aiagenticprogrammingsurvey}
  & Internal cognitive mechanisms and execution loops \\
\bottomrule
\end{tabular}
\end{table*}

\subsection{Foundations and Architectures of Agentic AI Systems}

The conceptual vocabulary of the field largely comes from four papers. Abou Ali and Dornaika \cite{abuali2025agentic} draw a line between \emph{symbolic/classical} agents---those relying on deterministic planners and explicit state machines---and \emph{neural/generative} agents driven by stochastic generation and prompt-based orchestration. This distinction, which the authors call a dual-paradigm framework, turns out to be practically useful: the two families have different failure modes and suit different deployment contexts. Wang et al.\ \cite{wang2024survey} take a more component-oriented approach, proposing a unified model with three sub-systems: a \emph{brain} (the LLM itself), a \emph{perception} module, and an \emph{action} module. Arunkumar et al.\ \cite{arunkumar2026architectures} refine this by splitting the brain into Planning, Reasoning, and Memory sub-components. Derouiche et al.\ \cite{derouiche2025frameworks} then ground these abstractions in practice, mapping them to AutoGen \cite{autogendocs}, LangGraph \cite{langgraphdocs}, CrewAI \cite{crewaidocs}, and MetaGPT and comparing their engineering trade-offs.

\subsection{Multi-Agent Frameworks and Coordination}

Composing multiple agents introduces challenges that single-agent designs sidestep. He, Treude, and Lo \cite{ishibashi2024multiagent} survey LLM-based multi-agent (LMA) systems across the software development lifecycle and find that coordination and trust become the dominant concerns once agents take on specialised roles---more so than raw capability. Rajendran et al.\ \cite{ieee2025multiagent} propose a conceptual framework for software design and refactoring that handles this through auction-based task allocation and consensus protocols. The idea is that competing bids surface disagreement early rather than letting conflicting outputs propagate. SALLMA \cite{sallma2025}, from Becattini, Verdecchia, and Vicario, sits at a lower level of abstraction: it is a reference software architecture that specifies concrete interfaces for shared state and real-time agent communication, treating the multi-agent system as something an architect would actually need to deploy and maintain.

\subsection{Applications in Software Engineering}

Jin et al.\ \cite{jin2024llmagents} survey six SE lifecycle domains and find meaningful differences between bare LLM prompting and agent-based approaches, particularly in tasks requiring iterative refinement. Liu et al.\ \cite{liu2024llmse} take a wider lens, categorising 124 papers and noting that tool augmentation and persistent memory are the two capability additions that most consistently improve results---more so than switching to a larger model. The most direct empirical reference point is SWE-bench \cite{jimenez2024swebench}: 2,294 real GitHub issues across 12 Python repositories, each requiring a multi-file patch that passes the existing test suite. It is not a gentle benchmark.

\subsection{Tool Use, Planning, and Reasoning}

Masterman et al.\ \cite{masterman2024landscape} and Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey} examine how agents actually decompose goals and make use of external tools. Masterman et al.\ identify a three-phase loop---\emph{planning}, \emph{execution}, and \emph{reflection}---and observe that omitting the reflection phase makes systems noticeably more brittle. Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey} frame this as \emph{agentic programming}: the LLM writes code, runs it, reads the output, and revises, much like a developer iterating in a REPL. Park et al.\ \cite{park2023generative} supply the empirical underpinning: a 25-agent simulation showing that combining memory retrieval, reflection, and planning produces coherent behaviour over extended time horizons in a way that any one mechanism alone does not.

\section{Detailed Discussion}

\subsection{Foundations and Architectures}

One of the more useful contributions of Abou Ali and Dornaika \cite{abuali2025agentic} is simply drawing a cleaner boundary. Earlier survey work tended to lump 1980s rule-based planners together with modern LLM-driven agents, which made it hard to reason about failure modes or architectural selection. Their dual-paradigm split---symbolic versus neural---gives practitioners a basis for that choice. Reviewing 90 studies from 2018--2025, the authors find that symbolic agents still dominate settings where determinism and formal verification matter, while neural agents have taken over wherever adaptability to messy, data-rich inputs is more important than guarantees.

Wang et al.\ \cite{wang2024survey} are less interested in lineage and more in components. Their architecture places the LLM at the centre as a reasoning engine, flanked by a perception module and an action module. The memory treatment is worth noting: they separate \emph{in-context} (working) memory, which is fast but bounded by the model's context window, from \emph{external} memory stored in vector databases or knowledge graphs. The second type scales to arbitrary size, but every retrieval is a potential source of latency and recall error---a trade-off that does not disappear as hardware improves.

Arunkumar et al.\ \cite{arunkumar2026architectures} push the taxonomy toward evaluation, arguing that task completion rate is too coarse a metric if the goal is to understand which architectural layer is actually failing. Their historical account of agent loop evolution is useful context: early designs like ReAct used flat sequential structures that are easy to implement but poor at backtracking, while more recent systems use hierarchical search and recursive decomposition to handle non-linear problem solving. Derouiche et al.\ \cite{derouiche2025frameworks} then connect these design choices to framework selection: LangGraph's \cite{langgraphdocs} graph-based execution model handles stateful, cyclical workflows well, while CrewAI \cite{crewaidocs} is easier to configure when the main requirement is a straightforward role-based pipeline.

\subsection{Multi-Agent Frameworks}

The central tension He, Treude, and Lo \cite{ishibashi2024multiagent} identify is not surprising in retrospect: the more specialised your agents become, the more inter-agent communication you need to stop them from producing conflicting outputs. Their proposed research agenda---improve individual capability and coordination simultaneously---is reasonable, though it somewhat sidesteps the question of how to prioritise when resources are constrained.

Rajendran et al.\ \cite{ieee2025multiagent} try to operationalise coordination. Their framework decomposes a change request into subtasks and auctions them to specialised agents; when outputs conflict, a consensus protocol arbitrates. Whether auction-based allocation actually beats simpler assignment strategies in practice is not empirically established in the paper, which remains conceptual. SALLMA \cite{sallma2025} is more concrete. By separating agent logic from infrastructure and prescribing relational databases for structured metadata alongside NoSQL stores for unstructured agent memory, it treats multi-agent systems as something that has to be operated, not just designed. This framing---applying standard quality attributes like availability and maintainability to agentic systems---is one of the more practically grounded contributions in the surveyed literature.

\subsection{Software Engineering Applications}

Jin et al.\ \cite{jin2024llmagents} cover six SE lifecycle domains, and the picture that emerges is uneven. Requirements engineering and documentation generation look relatively tractable; the gap between agent-based and standalone LLM performance in code generation is real but narrows as tasks become more self-contained. The survey's honest conclusion---that the field lacks unified evaluation standards---means most of these comparisons rest on heterogeneous benchmarks and cannot be taken at face value.

Liu et al.\ \cite{liu2024llmse} analyse 124 papers and find that \emph{tool augmentation} and \emph{memory mechanisms} account for more of the performance variation than model size does. Agents that can call a compiler and keep context across sessions do meaningfully better; adding more agents to the loop helps further on tasks requiring parallel exploration, but the returns diminish faster than the coordination costs suggest.

SWE-bench \cite{jimenez2024swebench} is worth dwelling on. The 2,294 tasks come from real GitHub issue trackers, require navigating codebases of meaningful size, and only count as solved if the patch actually passes the existing test suite. Claude~2 resolved 1.96\% of them at the time of publication. Leading systems crossed 50\% by 2025, which is genuine progress---but it also means the median real-world bug is still out of reach.

\subsection{Planning, Reasoning, and Tool Use}

Masterman et al.\ \cite{masterman2024landscape} make the case that the planning-execution-reflection loop is the single most consequential architectural choice in agentic system design. Dropping reflection makes systems brittle in a characteristic way: they commit to a plan that is slightly wrong and cannot course-correct. Adding structured self-critique (chain-of-thought self-evaluation being the most common form) recovers robustness, but the token cost and latency overhead are real considerations at production scale. The authors also observe that multi-agent systems tend to waste compute when agents work in parallel without a designated orchestrator---designating one reduces both redundant computation and conflicting execution states.

Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey} describe \emph{agentic programming} as a methodology rather than just a capability: the agent writes code, executes it, reads the output, and revises in a loop that resembles test-driven development. The more interesting claim is that standard software engineering practices---CI, version control, code review---are not obstacles to agentic execution but potential constraints that make it safer and more auditable. That argument has not been tested at scale, but it points toward an integration story that is more credible than treating agents as replacements for existing tooling.

Park et al.\ \cite{park2023generative} provide the empirical baseline for long-horizon behaviour. A 25-agent simulation combining a \emph{memory stream}, \emph{reflection}, and \emph{planning} produced coherent behaviour over time in a way that any one of those components alone did not. The simulation context is far simpler than a real software project, but the finding that all three components are jointly necessary---not interchangeable---has influenced most subsequent architectural work.

\section{Critical Analysis}

\subsection{Advancements}

Compared to where AI-assisted software engineering stood five years ago, the progress is real. Terms like \emph{tool augmentation}, \emph{reflection}, and \emph{multi-agent coordination} had inconsistent or no definitions in earlier literature; they now carry reasonably stable meanings across papers \cite{abuali2025agentic, wang2024survey, masterman2024landscape}. Architectural patterns have been worked out in enough detail to be implemented in open-source frameworks and measured against reproducible benchmarks \cite{derouiche2025frameworks, jimenez2024swebench}. The jump from sub-2\% to over 50\% on SWE-bench between 2023 and 2025 is the kind of trajectory that justifies the field's current attention, even if it also raises questions about what happens as the easy gains run out.

\subsection{Challenges and Limitations}

\textbf{Reliability and hallucination.} Neural agents carry the hallucination problem of their underlying LLMs into a context that amplifies it \cite{abuali2025agentic, jin2024llmagents}. When an agent executes a hallucinated plan across thirty tool calls before the error surfaces, the resulting state may be difficult or impossible to recover. This is qualitatively different from a standalone LLM producing a wrong answer that a human can discard.

\textbf{Evaluation fragmentation.} Jin et al.\ \cite{jin2024llmagents} and Liu et al.\ \cite{liu2024llmse} both flag the absence of unified evaluation standards, and it is not a minor complaint---it means most cross-paper comparisons in this survey are only approximate. SWE-bench \cite{jimenez2024swebench} closes the gap for patch generation. For requirements engineering, architecture design, and system-level testing, the field is still measuring each team's work against its own ruler.

\textbf{Coordination scalability.} The auction and consensus mechanisms in \cite{ieee2025multiagent} and the SALLMA architecture \cite{sallma2025} were designed for systems with a handful of agents. Whether they hold up with dozens or hundreds of concurrent agents is largely untested, and there is no strong theoretical reason to expect linear scaling.

\textbf{Context window limits.} This is an architectural constraint rather than a research gap---every LLM has one, and no amount of clever prompting makes it disappear \cite{liu2024llmse, wang2025aiagenticprogrammingsurvey}. External memory pushes the problem out but does not eliminate it; retrieval accuracy degrades as the knowledge base grows, and the degradation is not always predictable.

\textbf{Security and governance.} An agent with read/write access to a file system, a compiler, and a network interface is a significant attack surface \cite{abuali2025agentic}. Prompt injection attacks against agentic systems have been demonstrated outside the lab. None of the architectural designs surveyed here treat this as a first-class concern; it appears, if at all, as a footnote on future work.

\subsection{Comparing Approaches}

The debate over single-agent versus multi-agent architectures remains unresolved, with the divergence stemming as much from methodological asymmetry as from genuine differences in architectural capability. Masterman et al.\ \cite{masterman2024landscape} advance the case for single-agent sufficiency: their evaluation demonstrates that an agent equipped with a complete planning-execution-reflection loop achieves competitive performance with multi-agent ensembles while incurring substantially lower coordination overhead, and their key empirical observation---that omitting the reflection phase produces characteristic brittleness, causing agents to commit to subtly wrong plans without course-correcting---suggests that architectural completeness within a single agent may substitute for distributional specialisation across an agent team. Park et al.\ \cite{park2023generative} reinforce this interpretation through their 25-agent simulation: coherent long-horizon behaviour emerges only when memory retrieval, reflection, and planning are instantiated jointly, with any two-component subset producing noticeably degraded outcomes, a non-additive interaction pattern that runs counter to the assumption that each mechanism contributes independently.

Against this, He, Treude, and Lo \cite{ishibashi2024multiagent} argue that for tasks requiring concurrent exploration of disjoint state spaces, the sequential planning bottleneck inherent to single-agent designs becomes the binding constraint regardless of how refined each architectural component is. Rajendran et al.\ \cite{ieee2025multiagent} operationalise this advantage through an auction-based task allocation protocol in which competing agent bids surface decomposition conflicts before they propagate through the execution graph---a coordination mechanism without a natural single-agent analogue, and one whose benefit is most visible precisely on the compositionally complex tasks that single-agent evaluations tend to exclude. The experimental record is therefore difficult to reconcile on a common footing: multi-agent papers systematically evaluate on tasks of greater compositional depth, confounding architectural comparison with task difficulty. What the available data do consistently demonstrate, as SALLMA's infrastructure analysis makes explicit \cite{sallma2025}, is that coordination overhead scales superlinearly with agent count---an empirical ceiling on the multi-agent advantage that becomes binding faster than the optimistic framing of distributed architectures typically acknowledges, and one that no current framework has credibly resolved.

\section{Future Directions}

\textbf{Hybrid neuro-symbolic architectures.} Abou Ali and Dornaika \cite{abuali2025agentic} call for hybrid designs that pair neural flexibility with symbolic verifiability. The specific proposal---a symbolic planner that checks a neural agent's plan before execution---is one reasonable instantiation, though the hard part is specifying what ``safe'' means formally for a plan that modifies a codebase. That engineering problem is not solved by proposing the architecture.

\textbf{Evaluation beyond patch generation.} The benchmarking gap flagged by Jin et al.\ \cite{jin2024llmagents} and Liu et al.\ \cite{liu2024llmse} is probably the most practically limiting gap in the field right now. SWE-bench \cite{jimenez2024swebench} does one thing well; there is no equivalent for requirements elicitation, high-level design decisions, or system integration testing. Building such benchmarks is unglamorous work, but without them the field will keep talking past itself.

\textbf{Long-horizon memory at project scale.} Park et al.\ \cite{park2023generative} demonstrate that persistent memory and reflection work in a contained simulation. The open question is how memory mechanisms behave when an agent must track thousands of source files and requirements that evolve across a months-long project. Continual learning research is the closest relevant body of work, but it has not been seriously applied in this context.

\textbf{Security engineering for agentic systems.} The governance gap \cite{abuali2025agentic} is not just a research priority---it is a deployment risk. Formal threat models, sandboxing designs, and audit logs that let operators reconstruct what an agent did and why are all currently missing from proposed architectures. It is somewhat surprising that none of the multi-agent frameworks surveyed here treat access control as a first-class concern.

\textbf{Human-agent interaction protocols.} He et al.\ \cite{ishibashi2024multiagent} and Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey} both converge on a collaborative model where humans and agents share responsibility rather than the agent operating autonomously. What that means in practice---when should an agent stop and ask for clarification, how do human corrections propagate through an existing plan, how is agent uncertainty communicated to someone without an ML background---is almost entirely unspecified in the current literature.

\section{Conclusion}

The 13 papers reviewed here span roughly three years of a field advancing rapidly, and the picture that emerges is genuinely mixed. Open-source frameworks \cite{derouiche2025frameworks} that implement the architectural patterns described in this survey are actively deployed in production settings, and SWE-bench \cite{jimenez2024swebench} supplies a shared empirical reference point against which the community has begun to converge. Only since the instruction-tuning era has it become feasible to treat \emph{hallucination propagation}, \emph{context window saturation}, and \emph{prompt injection} as first-class engineering parameters to be managed rather than properties that categorically preclude deployment. The architectural vocabulary for memory, planning, and multi-agent coordination---collectively constituting what practitioners now call the agent loop---is stable enough for principled comparison, a state of affairs that Natural Language Processing (NLP) research alone could not have delivered without the complementary advances in agent scaffolding documented across this literature.

On the other hand, the limitations identified in this survey are not peripheral details. Hallucination in agentic execution loops is qualitatively more dangerous than in single-turn generation: a tool-call chain that fails at step twenty-eight may leave a codebase in a partially modified state that requires forensic inspection to diagnose, with no straightforward rollback if the agent did not maintain a structured execution log. Nowhere in the surveyed architectural proposals do security and governance appear as first-class design concerns, despite documented demonstrations of adversarial prompt injection against deployed agentic systems \cite{abuali2025agentic}. Evaluation practices outside patch generation remain fragmented, meaning most cross-paper performance comparisons rest on incommensurable baselines, and the long-horizon project-scale autonomy that would represent a genuine shift in software development practice has not been convincingly demonstrated at scale.

For practitioners adopting agentic AI today, the implication is not to wait, but to proceed with deliberate architecture: designing for human oversight at each agent decision boundary, investing in evaluation infrastructure that spans the full software engineering lifecycle, and treating the agent as a system component subject to the same quality attributes---reliability, security, maintainability---that govern every other element of the stack \cite{sallma2025}. The trajectory from sub-2\% to over 50\% on SWE-bench across two years is a credible signal that further progress is achievable. The architectural vocabulary surveyed here supplies a sufficient foundation for principled system design, provided one resists conflating benchmark performance with operational dependability. What this field still requires---formal threat models, lifecycle-spanning evaluation standards, deployable governance frameworks---are not peripheral research interests but engineering preconditions for the dependable deployment that the field's own architectural ambitions implicitly demand.

\bibliographystyle{IEEEtran}
\bibliography{references}

\end{document}