\documentclass[manuscript, anonymous=false]{acmart}

\geometry{twoside=false, left=2.5cm, right=2.5cm, top=2.5cm, bottom=2.5cm}

\setcopyright{none}
\acmDOI{}
\acmISBN{}
\acmConference[JC3506]{Software Design and Implementation}{2025--2026}{University of Aberdeen}

\usepackage{booktabs}
\usepackage{float}

\begin{document}

\title{Software System Design with Agentic AI: A Systematic Literature Survey}

\author{SiFan Chen}
\affiliation{%
  \institution{University of Aberdeen}
  \country{United Kingdom}
}
\email{u28sc22@abdn.ac.uk}

\begin{abstract}
Agentic AI---where large language models are embedded in autonomous loops capable of perceiving inputs, forming plans, calling external tools, and revising their own behaviour---has moved from a research curiosity to something practitioners are actively deploying. This paper surveys 13 peer-reviewed and widely cited papers from 2023--2026 on how these systems are being designed and where they fall short. Four themes structure the review: foundational architectures and taxonomies; multi-agent frameworks and coordination; applications across the software engineering lifecycle; and the planning, reasoning, and tool-use mechanisms that make agents tick. The analysis surfaces five persistent open problems: hallucination and reliability, fragmented evaluation practices, coordination overhead at scale, context window constraints, and an almost complete absence of governance frameworks. Several directions look promising---hybrid neuro-symbolic designs, lifecycle-spanning benchmarks, long-horizon persistent memory---but the path from current demonstrations to dependable practice remains considerably longer than recent benchmark numbers suggest.
\end{abstract}

\keywords{agentic AI, software system design, large language models, multi-agent systems, autonomous software engineering}

\maketitle

\section{Introduction}

AI tools have been part of the software engineer's toolkit for years---code completion, static analysis, defect prediction---but they have always operated in a supporting role. The developer decides; the tool assists. What has changed recently is the emergence of systems where that division no longer holds so cleanly. Under the label of \emph{agentic AI}, large language models (LLMs) are now embedded in execution loops that let them perceive their environment, make plans, call external tools, and update their behaviour in response to feedback, all without a human directing each step \cite{abuali2025agentic, wang2024survey}.

For software system design, this is not just an incremental improvement. Classical architectures assume a clear boundary between human intent and machine execution. Agentic systems complicate that picture: a single agent, or a group of them, can in principle draft requirements, generate and test code, run a code review, and refactor a module---cycling through these phases without waiting for a human to issue each command \cite{jin2024llmagents}. How to structure such systems, how to get them to plan reliably, how to coordinate multiple agents, and how to measure any of this is now an open engineering and research problem.

This survey reviews recent literature on agentic AI system design across four areas: foundational architectures and taxonomies; multi-agent frameworks and coordination; applications to concrete software engineering tasks; and the internal planning, reasoning, and tool-use mechanisms that agents depend on. A critical analysis of limitations and an outline of future directions follow.

The 13 primary papers span 2023--2026, sourced from IEEE Xplore, the ACM Digital Library, and arXiv. All are peer-reviewed conference or journal papers, or preprints with documented subsequent journal acceptance.

\section{Research Methodology}

A systematic search was conducted across IEEE Xplore, the ACM Digital Library, arXiv, and Google Scholar using the following keyword combinations:
\emph{agentic AI software system design};
\emph{LLM-based autonomous agents software engineering};
\emph{multi-agent systems LLM software architecture};
\emph{AI agent planning reasoning tool use};
\emph{autonomous software development benchmark}.

\textbf{Inclusion criteria:} (i) published or submitted after January 2023; (ii) directly addresses the architecture, capabilities, or evaluation of agentic AI systems in a software design or software engineering context; (iii) available as a full paper.

\textbf{Exclusion criteria:} (i) work focused exclusively on narrow NLP tasks without a software engineering application; (ii) papers whose primary contribution is a new LLM pre-training method rather than an agentic system design.

The initial search returned over 200 candidates. After de-duplication and title-and-abstract screening, 13 primary papers were retained and grouped into four thematic clusters as described in Section~\ref{sec:themes}.

\textbf{Use of AI-assisted tools.} DeepSeek was used as a supplementary aid for literature organisation and error checking in accordance with the course guidelines. All paper selection, critical analysis, and editorial judgement are the author's own.

\section{Thematic Overview}
\label{sec:themes}

The 13 selected papers are grouped into four thematic clusters in Table~\ref{tab:themes}. Several papers contribute to more than one theme; each is assigned to its dominant focus.

\begin{table*}[t]
\caption{Thematic classification of surveyed papers.}
\label{tab:themes}
\begin{tabular}{@{}lp{3.5cm}p{7cm}@{}}
\toprule
\textbf{Theme} & \textbf{Papers} & \textbf{Core focus} \\
\midrule
Foundations \& Architectures
  & \cite{abuali2025agentic, wang2024survey, arunkumar2026architectures, derouiche2025frameworks}
  & Taxonomies, paradigms, and framework comparisons \\
\addlinespace
Multi-Agent Systems
  & \cite{ishibashi2024multiagent, ieee2025multiagent, sallma2025}
  & Coordination, communication, and architectural patterns for agent teams \\
\addlinespace
SE Applications
  & \cite{jin2024llmagents, liu2024llmse, jimenez2024swebench}
  & Applying agents to requirements, code generation, testing, and maintenance \\
\addlinespace
Planning, Reasoning \& Tool Use
  & \cite{masterman2024landscape, park2023generative, wang2025aiagenticprogrammingsurvey}
  & Internal cognitive mechanisms and execution loops \\
\bottomrule
\end{tabular}
\end{table*}

\subsection{Foundations and Architectures of Agentic AI Systems}

The conceptual vocabulary of the field largely comes from four papers. Abou Ali and Dornaika \cite{abuali2025agentic} draw a line between \emph{symbolic/classical} agents---those relying on deterministic planners and explicit state machines---and \emph{neural/generative} agents driven by stochastic generation and prompt-based orchestration. This distinction, which the authors call a dual-paradigm framework, turns out to be practically useful: the two families have different failure modes and suit different deployment contexts. Wang et al.\ \cite{wang2024survey} take a more component-oriented approach, proposing a unified model with three sub-systems: a \emph{brain} (the LLM itself), a \emph{perception} module, and an \emph{action} module. Arunkumar et al.\ \cite{arunkumar2026architectures} refine this by splitting the brain into Planning, Reasoning, and Memory sub-components. Derouiche et al.\ \cite{derouiche2025frameworks} then ground these abstractions in practice, mapping them to AutoGen \cite{autogendocs}, LangGraph \cite{langgraphdocs}, CrewAI \cite{crewaidocs}, and MetaGPT and comparing their engineering trade-offs.

\subsection{Multi-Agent Frameworks and Coordination}

Composing multiple agents introduces challenges that single-agent designs sidestep. He, Treude, and Lo \cite{ishibashi2024multiagent} survey LLM-based multi-agent (LMA) systems across the software development lifecycle and find that coordination and trust become the dominant concerns once agents take on specialised roles---more so than raw capability. Rajendran et al.\ \cite{ieee2025multiagent} propose a conceptual framework for software design and refactoring that handles this through auction-based task allocation and consensus protocols. The idea is that competing bids surface disagreement early rather than letting conflicting outputs propagate. SALLMA \cite{sallma2025}, from Becattini, Verdecchia, and Vicario, sits at a lower level of abstraction: it is a reference software architecture that specifies concrete interfaces for shared state and real-time agent communication, treating the multi-agent system as something an architect would actually need to deploy and maintain.

\subsection{Applications in Software Engineering}

Jin et al.\ \cite{jin2024llmagents} survey six SE lifecycle domains and find meaningful differences between bare LLM prompting and agent-based approaches, particularly in tasks requiring iterative refinement. Liu et al.\ \cite{liu2024llmse} take a wider lens, categorising 124 papers and noting that tool augmentation and persistent memory are the two capability additions that most consistently improve results---more so than switching to a larger model. The most direct empirical reference point is SWE-bench \cite{jimenez2024swebench}: 2,294 real GitHub issues across 12 Python repositories, each requiring a multi-file patch that passes the existing test suite. It is not a gentle benchmark.

\subsection{Tool Use, Planning, and Reasoning}

Masterman et al.\ \cite{masterman2024landscape} and Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey} examine how agents actually decompose goals and make use of external tools. Masterman et al.\ identify a three-phase loop---\emph{planning}, \emph{execution}, and \emph{reflection}---and observe that omitting the reflection phase makes systems noticeably more brittle. Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey} frame this as \emph{agentic programming}: the LLM writes code, runs it, reads the output, and revises, much like a developer iterating in a REPL. Park et al.\ \cite{park2023generative} supply the empirical underpinning: a 25-agent simulation showing that combining memory retrieval, reflection, and planning produces coherent behaviour over extended time horizons in a way that any one mechanism alone does not.

\section{Detailed Discussion}

\subsection{Foundations and Architectures}

One of the more useful contributions of Abou Ali and Dornaika \cite{abuali2025agentic} is simply drawing a cleaner boundary. Earlier survey work tended to lump 1980s rule-based planners together with modern LLM-driven agents, which made it hard to reason about failure modes or architectural selection. Their dual-paradigm split---symbolic versus neural---gives practitioners a basis for that choice. Reviewing 90 studies from 2018--2025, the authors find that symbolic agents still dominate settings where determinism and formal verification matter, while neural agents have taken over wherever adaptability to messy, data-rich inputs is more important than guarantees.

Wang et al.\ \cite{wang2024survey} are less interested in lineage and more in components. Their architecture places the LLM at the centre as a reasoning engine, flanked by a perception module and an action module. The memory treatment is worth noting: they separate \emph{in-context} (working) memory, which is fast but bounded by the model's context window, from \emph{external} memory stored in vector databases or knowledge graphs. The second type scales to arbitrary size, but every retrieval is a potential source of latency and recall error---a trade-off that does not disappear as hardware improves.

Arunkumar et al.\ \cite{arunkumar2026architectures} push the taxonomy toward evaluation, arguing that task completion rate is too coarse a metric if the goal is to understand which architectural layer is actually failing. Their historical account of agent loop evolution is useful context: early designs like ReAct used flat sequential structures that are easy to implement but poor at backtracking, while more recent systems use hierarchical search and recursive decomposition to handle non-linear problem solving. Derouiche et al.\ \cite{derouiche2025frameworks} then connect these design choices to framework selection: LangGraph's \cite{langgraphdocs} graph-based execution model handles stateful, cyclical workflows well, while CrewAI \cite{crewaidocs} is easier to configure when the main requirement is a straightforward role-based pipeline.

\subsection{Multi-Agent Frameworks}

The central tension He, Treude, and Lo \cite{ishibashi2024multiagent} identify is not surprising in retrospect: the more specialised your agents become, the more inter-agent communication you need to stop them from producing conflicting outputs. Their proposed research agenda---improve individual capability and coordination simultaneously---is reasonable, though it somewhat sidesteps the question of how to prioritise when resources are constrained.

Rajendran et al.\ \cite{ieee2025multiagent} try to operationalise coordination. Their framework decomposes a change request into subtasks and auctions them to specialised agents; when outputs conflict, a consensus protocol arbitrates. Whether auction-based allocation actually beats simpler assignment strategies in practice is not empirically established in the paper, which remains conceptual. SALLMA \cite{sallma2025} is more concrete. By separating agent logic from infrastructure and prescribing relational databases for structured metadata alongside NoSQL stores for unstructured agent memory, it treats multi-agent systems as something that has to be operated, not just designed. This framing---applying standard quality attributes like availability and maintainability to agentic systems---is one of the more practically grounded contributions in the surveyed literature.

\subsection{Software Engineering Applications}

Jin et al.\ \cite{jin2024llmagents} cover six SE lifecycle domains, and the picture that emerges is uneven. Requirements engineering and documentation generation look relatively tractable; the gap between agent-based and standalone LLM performance in code generation is real but narrows as tasks become more self-contained. The survey's honest conclusion---that the field lacks unified evaluation standards---means most of these comparisons rest on heterogeneous benchmarks and cannot be taken at face value.

Liu et al.\ \cite{liu2024llmse} analyse 124 papers and find that \emph{tool augmentation} and \emph{memory mechanisms} account for more of the performance variation than model size does. Agents that can call a compiler and keep context across sessions do meaningfully better; adding more agents to the loop helps further on tasks requiring parallel exploration, but the returns diminish faster than the coordination costs suggest.

SWE-bench \cite{jimenez2024swebench} is worth dwelling on. The 2,294 tasks come from real GitHub issue trackers, require navigating codebases of meaningful size, and only count as solved if the patch actually passes the existing test suite. Claude~2 resolved 1.96\% of them at the time of publication. Leading systems crossed 50\% by 2025, which is genuine progress---but it also means the median real-world bug is still out of reach.

\subsection{Planning, Reasoning, and Tool Use}

Masterman et al.\ \cite{masterman2024landscape} make the case that the planning-execution-reflection loop is the single most consequential architectural choice in agentic system design. Dropping reflection makes systems brittle in a characteristic way: they commit to a plan that is slightly wrong and cannot course-correct. Adding structured self-critique (chain-of-thought self-evaluation being the most common form) recovers robustness, but the token cost and latency overhead are real considerations at production scale. The authors also observe that multi-agent systems tend to waste compute when agents work in parallel without a designated orchestrator---designating one reduces both redundant computation and conflicting execution states.

Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey} describe \emph{agentic programming} as a methodology rather than just a capability: the agent writes code, executes it, reads the output, and revises in a loop that resembles test-driven development. The more interesting claim is that standard software engineering practices---CI, version control, code review---are not obstacles to agentic execution but potential constraints that make it safer and more auditable. That argument has not been tested at scale, but it points toward an integration story that is more credible than treating agents as replacements for existing tooling.

Park et al.\ \cite{park2023generative} provide the empirical baseline for long-horizon behaviour. A 25-agent simulation combining a \emph{memory stream}, \emph{reflection}, and \emph{planning} produced coherent behaviour over time in a way that any one of those components alone did not. The simulation context is far simpler than a real software project, but the finding that all three components are jointly necessary---not interchangeable---has influenced most subsequent architectural work.

\section{Critical Analysis}

\subsection{Advancements}

Compared to where AI-assisted software engineering stood five years ago, the progress is real. Terms like \emph{tool augmentation}, \emph{reflection}, and \emph{multi-agent coordination} had inconsistent or no definitions in earlier literature; they now carry reasonably stable meanings across papers \cite{abuali2025agentic, wang2024survey, masterman2024landscape}. Architectural patterns have been worked out in enough detail to be implemented in open-source frameworks and measured against reproducible benchmarks \cite{derouiche2025frameworks, jimenez2024swebench}. The jump from sub-2\% to over 50\% on SWE-bench between 2023 and 2025 is the kind of trajectory that justifies the field's current attention, even if it also raises questions about what happens as the easy gains run out.

\subsection{Challenges and Limitations}

\textbf{Reliability and hallucination.} Neural agents carry the hallucination problem of their underlying LLMs into a context that amplifies it \cite{abuali2025agentic, jin2024llmagents}. When an agent executes a hallucinated plan across thirty tool calls before the error surfaces, the resulting state may be difficult or impossible to recover. This is qualitatively different from a standalone LLM producing a wrong answer that a human can discard.

\textbf{Evaluation fragmentation.} Jin et al.\ \cite{jin2024llmagents} and Liu et al.\ \cite{liu2024llmse} both flag the absence of unified evaluation standards, and it is not a minor complaint---it means most cross-paper comparisons in this survey are only approximate. SWE-bench \cite{jimenez2024swebench} closes the gap for patch generation. For requirements engineering, architecture design, and system-level testing, the field is still measuring each team's work against its own ruler.

\textbf{Coordination scalability.} The auction and consensus mechanisms in \cite{ieee2025multiagent} and the SALLMA architecture \cite{sallma2025} were designed for systems with a handful of agents. Whether they hold up with dozens or hundreds of concurrent agents is largely untested, and there is no strong theoretical reason to expect linear scaling.

\textbf{Context window limits.} This is an architectural constraint rather than a research gap---every LLM has one, and no amount of clever prompting makes it disappear \cite{liu2024llmse, wang2025aiagenticprogrammingsurvey}. External memory pushes the problem out but does not eliminate it; retrieval accuracy degrades as the knowledge base grows, and the degradation is not always predictable.

\textbf{Security and governance.} An agent with read/write access to a file system, a compiler, and a network interface is a significant attack surface \cite{abuali2025agentic}. Prompt injection attacks against agentic systems have been demonstrated outside the lab. None of the architectural designs surveyed here treat this as a first-class concern; it appears, if at all, as a footnote on future work.

\subsection{Comparing Approaches}

The single-agent versus multi-agent debate is not settled, and the disagreement is partly a measurement artefact. Masterman et al.\ \cite{masterman2024landscape} show that a single agent with strong reflection is competitive with multi-agent systems on a range of benchmarks and considerably easier to debug. He et al.\ \cite{ishibashi2024multiagent} and Rajendran et al.\ \cite{ieee2025multiagent} push back, arguing that specialisation in multi-agent systems produces better results on complex long-horizon tasks. Both positions are defensible given the benchmarks each paper uses; papers advocating multi-agent systems consistently evaluate on more complex tasks. Until there is a benchmark that varies task complexity as a controlled dimension, the debate will continue to generate more heat than light.

\section{Future Directions}

\textbf{Hybrid neuro-symbolic architectures.} Abou Ali and Dornaika \cite{abuali2025agentic} call for hybrid designs that pair neural flexibility with symbolic verifiability. The specific proposal---a symbolic planner that checks a neural agent's plan before execution---is one reasonable instantiation, though the hard part is specifying what ``safe'' means formally for a plan that modifies a codebase. That engineering problem is not solved by proposing the architecture.

\textbf{Evaluation beyond patch generation.} The benchmarking gap flagged by Jin et al.\ \cite{jin2024llmagents} and Liu et al.\ \cite{liu2024llmse} is probably the most practically limiting gap in the field right now. SWE-bench \cite{jimenez2024swebench} does one thing well; there is no equivalent for requirements elicitation, high-level design decisions, or system integration testing. Building such benchmarks is unglamorous work, but without them the field will keep talking past itself.

\textbf{Long-horizon memory at project scale.} Park et al.\ \cite{park2023generative} demonstrate that persistent memory and reflection work in a contained simulation. The open question is how memory mechanisms behave when an agent must track thousands of source files and requirements that evolve across a months-long project. Continual learning research is the closest relevant body of work, but it has not been seriously applied in this context.

\textbf{Security engineering for agentic systems.} The governance gap \cite{abuali2025agentic} is not just a research priority---it is a deployment risk. Formal threat models, sandboxing designs, and audit logs that let operators reconstruct what an agent did and why are all currently missing from proposed architectures. It is somewhat surprising that none of the multi-agent frameworks surveyed here treat access control as a first-class concern.

\textbf{Human-agent interaction protocols.} He et al.\ \cite{ishibashi2024multiagent} and Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey} both converge on a collaborative model where humans and agents share responsibility rather than the agent operating autonomously. What that means in practice---when should an agent stop and ask for clarification, how do human corrections propagate through an existing plan, how is agent uncertainty communicated to someone without an ML background---is almost entirely unspecified in the current literature.

\section{Conclusion}

The 13 papers reviewed here cover roughly three years of a field that has been moving quickly. The picture that emerges is genuinely mixed. On the positive side: open-source frameworks \cite{derouiche2025frameworks} are actively deployed, SWE-bench \cite{jimenez2024swebench} provides a shared empirical reference point, and the architectural vocabulary for memory, planning, and multi-agent coordination is now stable enough for careful comparison. That is more than could be said in 2022.

On the other hand, the gaps are not minor. Hallucination in agentic loops is harder to catch and recover from than in standalone LLM usage. Evaluation practices outside patch generation remain fragmented, which means many performance claims in the literature rest on shaky ground. Security and governance are essentially absent from the architectural proposals, which will matter increasingly as these systems acquire more capabilities and broader access. And the kind of long-horizon, project-scale autonomy that would constitute a genuine shift in how software is built has not been demonstrated convincingly.

For practitioners adopting agentic AI today, the implication is not to wait, but to be deliberate: design for human oversight, invest in evaluation infrastructure, and treat the agent as an architectural component with the same quality requirements---reliability, security, maintainability---as anything else in the system \cite{sallma2025}. The technology is real; the engineering discipline around it is still catching up.

\bibliographystyle{IEEEtran}
\bibliography{references}

\end{document}