remove unnecessary comments from main.tex and references.bib

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix bib
2026-05-10 20:00:31 +08:00 · 2026-05-10 17:14:18 +08:00
2 changed files with 143 additions and 140 deletions
@@ -1,24 +1,15 @@
 % Submission filename: JC3506_A1_<Surname>_<FirstName>_<StudentID>.pdf
 % Course: JC3506 Software Design and Implementation
 % Assessment: Individual Study — Systematic Literature Survey
 % Topic: Software System Design with Agentic AI
 \documentclass[manuscript, anonymous=false]{acmart}
 %% Force symmetric margins (override acmart's twoside default)
 \geometry{twoside=false, left=2.5cm, right=2.5cm, top=2.5cm, bottom=2.5cm}
 %% ACM rights / metadata — left blank for student submission
 \setcopyright{none}
 \acmDOI{}
 \acmISBN{}
 \acmConference[JC3506]{Software Design and Implementation}{2025--2026}{University of Aberdeen}
 %% Additional packages (acmart already loads hyperref, natbib, geometry)
 \usepackage{booktabs}
 \usepackage{float}
 % -------------------------------------------------------
 \begin{document}
 \title{Software System Design with Agentic AI: A Systematic Literature Survey}
@@ -38,18 +29,16 @@ Agentic AI systems---large language models embedded within autonomous execution
 \maketitle
 % -------------------------------------------------------
 \section{Introduction}
-Artificial intelligence has long been applied to software engineering in forms that assist but do not act: code completion tools, static analysers, and defect predictors all augment a human developer without replacing their judgment. A qualitatively different model has now emerged under the label of \emph{agentic AI}, in which a large language model (LLM) is embedded within an autonomous execution loop that can perceive its environment, form plans, invoke external tools, and revise its behaviour based on feedback---all without step-by-step human direction \cite{schmidgall2024agentic, wang2024survey}.
+Artificial intelligence has long been applied to software engineering in forms that assist but do not act: code completion tools, static analysers, and defect predictors all augment a human developer without replacing their judgment. A qualitatively different model has now emerged under the label of \emph{agentic AI}, in which a large language model (LLM) is embedded within an autonomous execution loop that can perceive its environment, form plans, invoke external tools, and revise its behaviour based on feedback---all without step-by-step human direction \cite{abuali2025agentic, wang2024survey}.
-This shift carries profound implications for software system design. Classical software architecture treats the system boundary as a fixed interface between human intent and machine execution. Agentic systems dissolve that boundary: a single agent or a team of collaborating agents can now draft requirements, generate and test code, perform code review, and refactor modules in a continuous loop \cite{liu2024llmagents}. The design of such systems---how agents are structured, how they plan, how they share state, and how they are evaluated---has itself become an active research area.
+This shift carries profound implications for software system design. Classical software architecture treats the system boundary as a fixed interface between human intent and machine execution. Agentic systems dissolve that boundary: a single agent or a team of collaborating agents can now draft requirements, generate and test code, perform code review, and refactor modules in a continuous loop \cite{jin2024llmagents}. The design of such systems---how agents are structured, how they plan, how they share state, and how they are evaluated---has itself become an active research area.
 This survey provides a structured review of recent literature on the design of software systems that incorporate or consist of agentic AI. The review covers four interlocking themes: (1) foundational architectures and taxonomies of agentic systems; (2) multi-agent frameworks and coordination mechanisms; (3) the application of agentic AI to concrete software engineering tasks; and (4) the reasoning, planning, and tool-use capabilities that underpin agent behaviour. The survey closes with a critical analysis of current limitations and a discussion of open research directions.
 The selected literature spans 2023--2026, drawn primarily from IEEE Xplore, the ACM Digital Library, and arXiv. All 13 primary sources are peer-reviewed conference or journal papers, or widely cited preprints with subsequent journal acceptance.
 % -------------------------------------------------------
 \section{Research Methodology}
 A systematic search was conducted across IEEE Xplore, the ACM Digital Library, arXiv, and Google Scholar using the following keyword combinations:
@@ -65,7 +54,6 @@ A systematic search was conducted across IEEE Xplore, the ACM Digital Library, a
 The initial search returned over 200 candidates. After de-duplication and title-and-abstract screening, 13 primary papers were retained and grouped into four thematic clusters as described in Section~\ref{sec:themes}.
 % -------------------------------------------------------
 \section{Thematic Overview}
 \label{sec:themes}
@@ -79,7 +67,7 @@ The 13 selected papers are grouped into four thematic clusters in Table~\ref{tab
 \textbf{Theme} & \textbf{Papers} & \textbf{Core focus} \\
 \midrule
 Foundations \& Architectures
-  & \cite{schmidgall2024agentic, wang2024survey, sun2026architectures, sun2025frameworks}
+  & \cite{abuali2025agentic, wang2024survey, arunkumar2026architectures, derouiche2025frameworks}
  & Taxonomies, paradigms, and framework comparisons \\
 \addlinespace
 Multi-Agent Systems
@@ -87,11 +75,11 @@ Multi-Agent Systems
  & Coordination, communication, and architectural patterns for agent teams \\
 \addlinespace
 SE Applications
-  & \cite{liu2024llmagents, yang2024llmse, jimenez2024swebench}
+  & \cite{jin2024llmagents, liu2024llmse, jimenez2024swebench}
  & Applying agents to requirements, code generation, testing, and maintenance \\
 \addlinespace
 Planning, Reasoning \& Tool Use
-  & \cite{masterman2024landscape, park2023generative, chen2025agentic}
+  & \cite{masterman2024landscape, park2023generative, wang2025aiagenticprogrammingsurvey}
  & Internal cognitive mechanisms and execution loops \\
 \bottomrule
 \end{tabular}
@@ -99,7 +87,7 @@ Planning, Reasoning \& Tool Use
 \subsection{Foundations and Architectures of Agentic AI Systems}
-The foundational literature establishes the conceptual vocabulary and architectural patterns that the rest of the field builds upon. Schmidgall and Dornaika \cite{schmidgall2024agentic} introduce a \emph{dual-paradigm} framework that separates \emph{symbolic/classical} agents (relying on deterministic planning and persistent state machines) from \emph{neural/generative} agents (driven by stochastic generation and prompt-based orchestration). Wang et al.\ \cite{wang2024survey} propose a unified architectural model centred on three sub-systems: a \emph{brain} (the LLM), a \emph{perception} module, and an \emph{action} module. Sun et al.\ \cite{sun2026architectures} extend this by decomposing the brain into Planning, Reasoning, and Memory components. The framework survey by Sun et al.\ \cite{sun2025frameworks} maps these abstractions onto concrete open-source frameworks---AutoGen, LangGraph, CrewAI, and MetaGPT---analysing their design trade-offs.
+The foundational literature establishes the conceptual vocabulary and architectural patterns that the rest of the field builds upon. Abou Ali and Dornaika \cite{abuali2025agentic} introduce a \emph{dual-paradigm} framework that separates \emph{symbolic/classical} agents (relying on deterministic planning and persistent state machines) from \emph{neural/generative} agents (driven by stochastic generation and prompt-based orchestration). Wang et al.\ \cite{wang2024survey} propose a unified architectural model centred on three sub-systems: a \emph{brain} (the LLM), a \emph{perception} module, and an \emph{action} module. Arunkumar et al.\ \cite{arunkumar2026architectures} extend this by decomposing the brain into Planning, Reasoning, and Memory components. The framework survey by Derouiche et al.\ \cite{derouiche2025frameworks} maps these abstractions onto concrete open-source frameworks---AutoGen, LangGraph, CrewAI, and MetaGPT---analysing their design trade-offs.
 \subsection{Multi-Agent Frameworks and Coordination}
@@ -107,22 +95,21 @@ Once individual agent architectures are established, a natural extension is comp
 \subsection{Tool Use, Planning, and Reasoning}
-The internal mechanisms that allow agents to decompose goals and invoke external resources are surveyed by Masterman et al.\ \cite{masterman2024landscape} and Chen et al.\ \cite{chen2025agentic}. Masterman et al.\ examine single-agent and multi-agent implementations and identify three critical phases---\emph{planning}, \emph{execution}, and \emph{reflection}---present in robust systems. Chen et al.\ focus on \emph{agentic programming} as an emerging paradigm in which agents autonomously iterate on a task. Park et al.\ \cite{park2023generative} provide a foundational empirical study demonstrating that architectures combining memory retrieval, reflection, and planning can produce coherent long-horizon behaviour.
+The internal mechanisms that allow agents to decompose goals and invoke external resources are surveyed by Masterman et al.\ \cite{masterman2024landscape} and Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey}. Masterman et al.\ examine single-agent and multi-agent implementations and identify three critical phases---\emph{planning}, \emph{execution}, and \emph{reflection}---present in robust systems. Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey} focus on \emph{agentic programming} as an emerging paradigm in which agents autonomously iterate on a task. Park et al.\ \cite{park2023generative} provide a foundational empirical study demonstrating that architectures combining memory retrieval, reflection, and planning can produce coherent long-horizon behaviour.
 \subsection{Applications in Software Engineering}
-Three papers evaluate agentic AI directly against software engineering tasks. Jin et al.\ \cite{liu2024llmagents} conduct a broad survey covering six SE domains, establishing clear distinctions between standalone LLMs and agent-based systems in terms of autonomy and self-improvement. Liu et al.\ \cite{yang2024llmse} categorise 124 papers from both the SE and agent-capability perspectives, showing that tool-augmented agents consistently outperform standalone models. Jimenez et al.\ \cite{jimenez2024swebench} introduce SWE-bench, a benchmark of 2,294 real-world GitHub issues drawn from 12 Python repositories, providing the field's most widely used empirical measuring stick.
+Three papers evaluate agentic AI directly against software engineering tasks. Jin et al.\ \cite{jin2024llmagents} conduct a broad survey covering six SE domains, establishing clear distinctions between standalone LLMs and agent-based systems in terms of autonomy and self-improvement. Liu et al.\ \cite{liu2024llmse} categorise 124 papers from both the SE and agent-capability perspectives, showing that tool-augmented agents consistently outperform standalone models. Jimenez et al.\ \cite{jimenez2024swebench} introduce SWE-bench, a benchmark of 2,294 real-world GitHub issues drawn from 12 Python repositories, providing the field's most widely used empirical measuring stick.
 % -------------------------------------------------------
 \section{Detailed Discussion}
 \subsection{Foundations and Architectures}
-The dual-paradigm framework of Schmidgall and Dornaika \cite{schmidgall2024agentic} resolves a persistent ambiguity: earlier surveys grouped rule-based planners from the 1980s with modern LLM-driven agents, obscuring fundamental differences in uncertainty handling and knowledge representation. By separating symbolic and neural lineages, the authors provide a principled basis for architectural selection. Their analysis of 90 studies (2018--2025) shows that symbolic agents dominate safety-critical settings where determinism and formal verification are required, while neural agents prevail in adaptive, data-rich environments.
+The dual-paradigm framework of Abou Ali and Dornaika \cite{abuali2025agentic} resolves a persistent ambiguity: earlier surveys grouped rule-based planners from the 1980s with modern LLM-driven agents, obscuring fundamental differences in uncertainty handling and knowledge representation. By separating symbolic and neural lineages, the authors provide a principled basis for architectural selection. Their analysis of 90 studies (2018--2025) shows that symbolic agents dominate safety-critical settings where determinism and formal verification are required, while neural agents prevail in adaptive, data-rich environments.
 Wang et al.\ \cite{wang2024survey} complement this with component-level analysis. Their architecture positions the LLM as a central reasoning engine. Memory is divided into \emph{in-context} (working) memory and \emph{external} memory (vector databases, knowledge graphs)---a distinction with direct engineering implications: in-context memory is bounded by the model's context window, while external memory scales arbitrarily but introduces retrieval latency and recall errors.
-Sun et al.\ \cite{sun2026architectures} extend the taxonomy to evaluation, arguing that agents should be assessed across all five architectural layers rather than solely by task completion rate. The authors document how early agent loops such as ReAct adopted flat sequential structures, while more recent designs use hierarchical search and recursive decomposition for non-linear problem solving. The framework comparison in Sun et al.\ \cite{sun2025frameworks} translates these abstractions into engineering decisions: LangGraph's graph-based execution model supports stateful, cyclical workflows, whereas CrewAI prioritises ease of configuration for role-based pipelines.
+Arunkumar et al.\ \cite{arunkumar2026architectures} extend the taxonomy to evaluation, arguing that agents should be assessed across all five architectural layers rather than solely by task completion rate. The authors document how early agent loops such as ReAct adopted flat sequential structures, while more recent designs use hierarchical search and recursive decomposition for non-linear problem solving. The framework comparison in Derouiche et al.\ \cite{derouiche2025frameworks} translates these abstractions into engineering decisions: LangGraph's graph-based execution model supports stateful, cyclical workflows, whereas CrewAI prioritises ease of configuration for role-based pipelines.
 \subsection{Multi-Agent Frameworks}
@@ -132,9 +119,9 @@ Rajendran et al.\ \cite{ieee2025multiagent} operationalise this in a conceptual
 \subsection{Software Engineering Applications}
-Jin et al.\ \cite{liu2024llmagents} survey six SE lifecycle domains. In requirements engineering, agents can elicit requirements through dialogue and generate formal specifications. In code generation, agent-based approaches outperform standalone LLM prompting by iterating on failing test cases. In software design, agents are used to generate class diagrams, API contracts, and architecture documentation. The survey concludes that the field lacks unified evaluation standards, making cross-paper comparison difficult.
+Jin et al.\ \cite{jin2024llmagents} survey six SE lifecycle domains. In requirements engineering, agents can elicit requirements through dialogue and generate formal specifications. In code generation, agent-based approaches outperform standalone LLM prompting by iterating on failing test cases. In software design, agents are used to generate class diagrams, API contracts, and architecture documentation. The survey concludes that the field lacks unified evaluation standards, making cross-paper comparison difficult.
-Liu et al.\ \cite{yang2024llmse} address this gap by analysing 124 papers from both SE-task and agent-capability perspectives. They find that the most impactful capability additions are \emph{tool augmentation} (granting the agent access to compilers, test runners, and search engines) and \emph{memory mechanisms} (allowing agents to maintain project-level context across sessions). Multi-agent coordination provides further gains on tasks requiring parallel exploration of solution spaces.
+Liu et al.\ \cite{liu2024llmse} address this gap by analysing 124 papers from both SE-task and agent-capability perspectives. They find that the most impactful capability additions are \emph{tool augmentation} (granting the agent access to compilers, test runners, and search engines) and \emph{memory mechanisms} (allowing agents to maintain project-level context across sessions). Multi-agent coordination provides further gains on tasks requiring parallel exploration of solution spaces.
 SWE-bench \cite{jimenez2024swebench} provides the most direct empirical evidence of the state of the art. Its 2,294 tasks require agents to understand issue descriptions, navigate large codebases, and produce multi-file patches that pass existing test suites. The 1.96\% success rate achieved by Claude~2 at publication time, rising to over 50\% for leading systems by 2025, demonstrates rapid progress but also the continued gap between agents and skilled developers on complex, open-ended tasks.
@@ -142,56 +129,52 @@ SWE-bench \cite{jimenez2024swebench} provides the most direct empirical evidence
 Masterman et al.\ \cite{masterman2024landscape} identify the planning-execution-reflection loop as the most consequential architectural decision in agentic system design. Systems that omit reflection are brittle; those incorporating structured self-critique (e.g., chain-of-thought self-evaluation) are more robust but incur higher token costs and latency. The survey also finds that multi-agent systems benefit from explicit leadership structures: designating an orchestrator agent reduces redundant computation and prevents conflicting execution states.
-Chen et al.\ \cite{chen2025agentic} take a programming-paradigm view, defining \emph{agentic programming} as a methodology in which the LLM agent acts as both programmer and executor: writing code, running it, observing output, and revising iteratively. This loop resembles test-driven development, and the authors argue that existing software engineering practices---continuous integration, version control, code review---can be adapted to constrain and validate agentic execution.
+Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey} take a programming-paradigm view, defining \emph{agentic programming} as a methodology in which the LLM agent acts as both programmer and executor: writing code, running it, observing output, and revising iteratively. This loop resembles test-driven development, and the authors argue that existing software engineering practices---continuous integration, version control, code review---can be adapted to constrain and validate agentic execution.
 Park et al.\ \cite{park2023generative} provide a foundational empirical study of long-horizon agent behaviour. Their 25-agent simulation demonstrates that combining three mechanisms---\emph{memory stream}, \emph{reflection}, and \emph{planning}---produces coherent, believable autonomous behaviour. The work is significant because it validates the three-component architecture at a fidelity not previously demonstrated.
 % -------------------------------------------------------
 \section{Critical Analysis}
 \subsection{Advancements}
-The literature represents a substantial advance over the state of AI-assisted software engineering five years ago. The conceptual vocabulary has matured: terms such as \emph{tool augmentation}, \emph{reflection}, and \emph{multi-agent coordination} now carry reasonably consistent definitions \cite{schmidgall2024agentic, wang2024survey, masterman2024landscape}. Architectural patterns have been formalised to the point where they can be instantiated in open-source frameworks and evaluated against reproducible benchmarks \cite{sun2025frameworks, jimenez2024swebench}. Performance on software engineering tasks improved rapidly: SWE-bench resolution rates climbed from under 2\% in 2023 to over 50\% by 2025.
+The literature represents a substantial advance over the state of AI-assisted software engineering five years ago. The conceptual vocabulary has matured: terms such as \emph{tool augmentation}, \emph{reflection}, and \emph{multi-agent coordination} now carry reasonably consistent definitions \cite{abuali2025agentic, wang2024survey, masterman2024landscape}. Architectural patterns have been formalised to the point where they can be instantiated in open-source frameworks and evaluated against reproducible benchmarks \cite{derouiche2025frameworks, jimenez2024swebench}. Performance on software engineering tasks improved rapidly: SWE-bench resolution rates climbed from under 2\% in 2023 to over 50\% by 2025.
 \subsection{Challenges and Limitations}
-\textbf{Reliability and hallucination.} Neural agents inherit the hallucination problem of their underlying LLMs \cite{schmidgall2024agentic, liu2024llmagents}. Unlike a standalone LLM response, an agentic system may execute a hallucinated plan across dozens of tool calls before the error becomes apparent, causing compounding damage that is difficult to reverse.
+\textbf{Reliability and hallucination.} Neural agents inherit the hallucination problem of their underlying LLMs \cite{abuali2025agentic, jin2024llmagents}. Unlike a standalone LLM response, an agentic system may execute a hallucinated plan across dozens of tool calls before the error becomes apparent, causing compounding damage that is difficult to reverse.
-\textbf{Evaluation fragmentation.} Both Jin et al.\ \cite{liu2024llmagents} and Liu et al.\ \cite{yang2024llmse} note the lack of unified evaluation standards. SWE-bench \cite{jimenez2024swebench} addresses this for patch generation, but no comparable benchmark exists for requirements engineering, architecture design, or system-level testing.
+\textbf{Evaluation fragmentation.} Both Jin et al.\ \cite{jin2024llmagents} and Liu et al.\ \cite{liu2024llmse} note the lack of unified evaluation standards. SWE-bench \cite{jimenez2024swebench} addresses this for patch generation, but no comparable benchmark exists for requirements engineering, architecture design, or system-level testing.
 \textbf{Coordination scalability.} The auction and consensus mechanisms in \cite{ieee2025multiagent} and the architectural guidelines in SALLMA \cite{sallma2025} address multi-agent coordination at small-to-medium scales. How these approaches perform with dozens or hundreds of concurrent agents remains largely unexplored.
-\textbf{Context window limits.} The finite context window of current LLMs constrains project-level state \cite{yang2024llmse, chen2025agentic}. External memory mitigates this but introduces retrieval accuracy degradation as the knowledge base grows.
+\textbf{Context window limits.} The finite context window of current LLMs constrains project-level state \cite{liu2024llmse, wang2025aiagenticprogrammingsurvey}. External memory mitigates this but introduces retrieval accuracy degradation as the knowledge base grows.
-\textbf{Security and governance.} Schmidgall and Dornaika \cite{schmidgall2024agentic} identify governance deficits as one of the most critical research gaps. An agent with access to a file system, compiler, and network interface represents a significant attack surface; prompt injection attacks have been demonstrated in practice but are not addressed by any of the surveyed architectural designs.
+\textbf{Security and governance.} Abou Ali and Dornaika \cite{abuali2025agentic} identify governance deficits as one of the most critical research gaps. An agent with access to a file system, compiler, and network interface represents a significant attack surface; prompt injection attacks have been demonstrated in practice but are not addressed by any of the surveyed architectural designs.
 \subsection{Comparing Approaches}
 A notable disagreement concerns the relative merits of single-agent versus multi-agent designs. Masterman et al.\ \cite{masterman2024landscape} find that single-agent systems with strong reflection are competitive with multi-agent systems on many benchmarks while being simpler to debug. He et al.\ \cite{ishibashi2024multiagent} and Rajendran et al.\ \cite{ieee2025multiagent} argue that specialisation in multi-agent systems produces qualitatively better results for complex, long-horizon tasks. The discrepancy is partly methodological: papers advocating multi-agent systems tend to evaluate on more complex tasks. A unified benchmark spanning task complexity would resolve this debate.
 % -------------------------------------------------------
 \section{Future Directions}
-\textbf{Hybrid neuro-symbolic architectures.} Schmidgall and Dornaika \cite{schmidgall2024agentic} explicitly call for hybrid designs that combine the flexibility of neural agents with the determinism and verifiability of symbolic planners. A symbolic planner could verify the safety of a neural agent's proposed plan before execution, providing formal guarantees currently absent from purely neural systems.
+\textbf{Hybrid neuro-symbolic architectures.} Abou Ali and Dornaika \cite{abuali2025agentic} explicitly call for hybrid designs that combine the flexibility of neural agents with the determinism and verifiability of symbolic planners. A symbolic planner could verify the safety of a neural agent's proposed plan before execution, providing formal guarantees currently absent from purely neural systems.
-\textbf{Standardised evaluation frameworks.} The evaluation gap identified by Jin et al.\ \cite{liu2024llmagents} and Liu et al.\ \cite{yang2024llmse} needs benchmarks spanning the full development lifecycle---not just code generation. Future work should develop equivalents to SWE-bench \cite{jimenez2024swebench} for requirements elicitation, high-level design, and system integration testing.
+\textbf{Standardised evaluation frameworks.} The evaluation gap identified by Jin et al.\ \cite{jin2024llmagents} and Liu et al.\ \cite{liu2024llmse} needs benchmarks spanning the full development lifecycle---not just code generation. Future work should develop equivalents to SWE-bench \cite{jimenez2024swebench} for requirements elicitation, high-level design, and system integration testing.
 \textbf{Long-horizon autonomy and persistent memory.} Park et al.\ \cite{park2023generative} demonstrate the potential of persistent memory and reflection, but their simulation is far simpler than a real software project. Future research should investigate how memory mechanisms scale when agents must track thousands of source files and evolving requirements over months-long cycles. Techniques from continual learning appear particularly relevant.
-\textbf{Security and trust.} The governance gaps flagged by Schmidgall and Dornaika \cite{schmidgall2024agentic} indicate that security engineering for agentic systems is largely open. Formal threat models, sandboxing mechanisms, and audit-log designs that allow operators to verify agent behaviour after the fact are all needed.
+\textbf{Security and trust.} The governance gaps flagged by Abou Ali and Dornaika \cite{abuali2025agentic} indicate that security engineering for agentic systems is largely open. Formal threat models, sandboxing mechanisms, and audit-log designs that allow operators to verify agent behaviour after the fact are all needed.
-\textbf{Human-agent collaboration models.} He et al.\ \cite{ishibashi2024multiagent} and Chen et al.\ \cite{chen2025agentic} suggest that the most productive near-term model is collaborative: humans and agents share responsibility across the lifecycle. Designing effective interaction protocols---when an agent should ask for clarification, how human corrections propagate through a plan, and how to represent agent uncertainty to non-expert stakeholders---remains an open problem.
+\textbf{Human-agent collaboration models.} He et al.\ \cite{ishibashi2024multiagent} and Wang et al.\ \cite{wang2025aiagenticprogrammingsurvey} suggest that the most productive near-term model is collaborative: humans and agents share responsibility across the lifecycle. Designing effective interaction protocols---when an agent should ask for clarification, how human corrections propagate through a plan, and how to represent agent uncertainty to non-expert stakeholders---remains an open problem.
 % -------------------------------------------------------
 \section{Conclusion}
-This survey has reviewed 13 papers published between 2023 and 2026 on the design of software systems incorporating agentic AI. The reviewed literature demonstrates that agentic AI has moved from a theoretical concept to a practical engineering challenge: open-source frameworks \cite{sun2025frameworks} are in active deployment, benchmarks \cite{jimenez2024swebench} provide reproducible measures of progress, and architectural patterns for memory, planning, and multi-agent coordination have been formalised sufficiently for critical comparison.
+This survey has reviewed 13 papers published between 2023 and 2026 on the design of software systems incorporating agentic AI. The reviewed literature demonstrates that agentic AI has moved from a theoretical concept to a practical engineering challenge: open-source frameworks \cite{derouiche2025frameworks} are in active deployment, benchmarks \cite{jimenez2024swebench} provide reproducible measures of progress, and architectural patterns for memory, planning, and multi-agent coordination have been formalised sufficiently for critical comparison.
 At the same time, the survey reveals that the field is far from maturity. Hallucination and unreliable planning constrain the autonomy that can be safely delegated. Evaluation standards remain fragmented. Governance and security frameworks are essentially absent from proposed architectural designs. And the long-horizon, project-scale autonomy that would represent a genuine transformation of software practice has not yet been convincingly demonstrated.
 The implications for software system design are clear: practitioners adopting agentic AI today must design for human oversight, invest in robust evaluation infrastructure, and treat the agent as an architectural component subject to the same quality attributes---reliability, security, maintainability---as any other system component \cite{sallma2025}. Researchers, meanwhile, have a rich agenda whose resolution will determine how quickly the field moves from promising demonstrations to dependable practice.
 % -------------------------------------------------------
 \bibliographystyle{ACM-Reference-Format}
 \bibliography{references}
@@ -1,70 +1,78 @@
 % references.bib — JC3506 Individual Study
 % Topic: Software System Design with Agentic AI
 % Cite in text with \cite{key}
 %
 % 13 primary papers organised by theme:
 %   Theme 1 — Foundations & Architectures        (4 papers)
 %   Theme 2 — Multi-Agent Systems & Frameworks   (3 papers)
 %   Theme 3 — Software Engineering Applications  (3 papers)
 %   Theme 4 — Planning, Reasoning & Tool Use     (3 papers)
 % -------------------------------------------------------
 % THEME 1: Foundations & Architectures of Agentic AI
 % -------------------------------------------------------
-% Comprehensive 2024 survey — good opening citation for the introduction
+% Comprehensive 2025 survey — dual-paradigm framework (symbolic vs neural)
-@misc{schmidgall2024agentic,
+@article{abuali2025agentic,
-  author        = {Schmidgall, Samuel and others},
+	title={Agentic AI: a comprehensive survey of architectures, applications, and future directions},
-  title         = {Agentic AI: A Comprehensive Survey of Architectures, Applications, and Future Directions},
+	volume={59},
-  year          = {2024},
+	ISSN={1573-7462},
-  eprint        = {2510.25445},
+	url={http://dx.doi.org/10.1007/s10462-025-11422-4},
-  archivePrefix = {arXiv},
+	DOI={10.1007/s10462-025-11422-4},
-  primaryClass  = {cs.AI}
+	number={1},
 	journal={Artificial Intelligence Review},
 	publisher={Springer Science and Business Media LLC},
 	author={Abou Ali, Mohamad and Dornaika, Fadi and Charafeddine, Jinan},
 	year={2025},
 	month=Nov 
 }
 % Widely cited foundational survey on LLM-based autonomous agents
@article{wang2024survey,
-  author  = {Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Ji-Rong},
+	title={A survey on large language model based autonomous agents},
-  title   = {A Survey on Large Language Model based Autonomous Agents},
+	volume={18},
-  journal = {Frontiers of Computer Science},
+	ISSN={2095-2236},
-  volume  = {18},
+	url={http://dx.doi.org/10.1007/s11704-024-40231-1},
-  number  = {6},
+	DOI={10.1007/s11704-024-40231-1},
-  pages   = {186345},
+	number={6},
-  year    = {2024},
+	journal={Frontiers of Computer Science},
-  doi     = {10.1007/s11704-024-40231-1}
+	publisher={Springer Science and Business Media LLC},
 	author={Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Jirong},
 	year={2024},
 	month=Mar 
 }
-% Taxonomy of agent architectures: Perception, Brain, Planning, Action, Tools
+% Taxonomy: Perception, Brain, Planning, Action, Tools; evaluation framework
-@misc{sun2026architectures,
+@misc{arunkumar2026architectures,
-  author        = {Sun, Yifan and others},
+	title={Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation of Large Language Model Agents}, 
-  title         = {Agentic Artificial Intelligence: Architectures, Taxonomies, and Evaluation of Large Language Model Agents},
+	author={Arunkumar V and Gangadharan G. R. and Rajkumar Buyya},
-  year          = {2026},
+	year={2026},
-  eprint        = {2601.12560},
+	eprint={2601.12560},
-  archivePrefix = {arXiv},
+	archivePrefix={arXiv},
-  primaryClass  = {cs.AI}
+	primaryClass={cs.AI},
 	url={https://arxiv.org/abs/2601.12560}, 
 }
-% Covers CrewAI, LangGraph, AutoGen, MetaGPT framework comparison
+% Systematic review of CrewAI, LangGraph, AutoGen, MetaGPT
-@misc{sun2025frameworks,
+@misc{derouiche2025frameworks,
-  author        = {Sun, Yifan and others},
+	title={Agentic AI Frameworks: Architectures, Protocols, and Design Challenges}, 
-  title         = {Agentic AI Frameworks: Architectures, Protocols, and Design Challenges},
+	author={Hana Derouiche and Zaki Brahmi and Haithem Mazeni},
-  year          = {2025},
+	year={2025},
-  eprint        = {2508.10146},
+	eprint={2508.10146},
-  archivePrefix = {arXiv},
+	archivePrefix={arXiv},
-  primaryClass  = {cs.MA}
+	primaryClass={cs.AI},
 	url={https://arxiv.org/abs/2508.10146}, 
 }
 % -------------------------------------------------------
 % THEME 2: Multi-Agent Systems & Coordination
 % -------------------------------------------------------
 % ACM TOSEM — literature review on LLM multi-agent SE systems (peer-reviewed journal)
@article{ishibashi2024multiagent,
-  author  = {Ishibashi, Yoichi and Nishimura, Yoshimasa},
+author = {He, Junda and Treude, Christoph and Lo, David},
-  title   = {{LLM}-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision and the Road Ahead},
+title = {LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead},
-  journal = {ACM Transactions on Software Engineering and Methodology},
+year = {2025},
-  year    = {2024},
+issue_date = {June 2025},
-  doi     = {10.1145/3712003}
+publisher = {Association for Computing Machinery},
 address = {New York, NY, USA},
 volume = {34},
 number = {5},
 issn = {1049-331X},
 url = {https://doi.org/10.1145/3712003},
 doi = {10.1145/3712003},
 abstract = {Integrating Large Language Models (LLMs) into autonomous agents marks a significant shift in the research landscape by offering cognitive abilities that are competitive with human planning and reasoning. This article explores the transformative potential of integrating Large Language Models into Multi-Agent (LMA) systems for addressing complex challenges in software engineering (SE). By leveraging the collaborative and specialized abilities of multiple agents, LMA systems enable autonomous problem-solving, improve robustness, and provide scalable solutions for managing the complexity of real-world software projects. In this article, we conduct a systematic review of recent primary studies to map the current landscape of LMA applications across various stages of the software development lifecycle (SDLC). To illustrate current capabilities and limitations, we perform two case studies to demonstrate the effectiveness of state-of-the-art LMA frameworks. Additionally, we identify critical research gaps and propose a comprehensive research agenda focused on enhancing individual agent capabilities and optimizing agent synergy. Our work outlines a forward-looking vision for developing fully autonomous, scalable, and trustworthy LMA systems, laying the foundation for the evolution of Software Engineering 2.0.},
 journal = {ACM Trans. Softw. Eng. Methodol.},
 month = may,
 articleno = {124},
 numpages = {30},
 keywords = {Large Language Models, Autonomous Agents, Multi-Agent Systems, Software Engineering}
 }
 % IEEE conference — multi-agent LLM environment for software design and refactoring
@@ -77,10 +85,12 @@
  number={},
  pages={488-493},
  keywords={Software design;Codes;Large language models;Scalability;Software quality;Software systems;Security;Optimization;Software engineering;Software development management;Multi-agent systems;Large Language Models;Software refactoring;Agent specialization;Consensus protocols;Auction mechanisms;Code quality},
-  doi={10.1109/SoutheastCon56624.2025.10971563}
+  doi={10.1109/SoutheastCon56624.2025.10971563}  
 }
-% IEEE conference — software architecture for LLM-based multi-agent systems (SALLMA)
+
 % IEEE/ACM workshop — reference software architecture for LLM-based multi-agent systems
@INPROCEEDINGS{sallma2025,
  author={Becattini, Marco and Verdecchia, Roberto and Vicario, Enrico},
  booktitle={2025 IEEE/ACM International Workshop New Trends in Software Architecture (SATrends)}, 
@@ -90,73 +100,83 @@
  number={},
  pages={5-8},
  keywords={Structured Query Language;Software architecture;NoSQL databases;Pressing;Market research;Software;Real-time systems;Faces;Multi-agent systems;Python;software architecture;se4ai;llm},
-  doi={10.1109/SATrends66715.2025.00006}
+  doi={10.1109/SATrends66715.2025.00006}}
 }
 % -------------------------------------------------------
 % THEME 3: Software Engineering Applications
 % -------------------------------------------------------
 % Survey of LLM agents across SE tasks: requirements, code gen, design, testing, maintenance
-@misc{liu2024llmagents,
+@misc{jin2024llmagents,
-  author        = {Liu, Junwei and others},
+      title={From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future}, 
-  title         = {From {LLMs} to {LLM}-based Agents for Software Engineering: A Survey of Current, Challenges and Future},
+      author={Haolin Jin and Linghan Huang and Haipeng Cai and Jun Yan and Bo Li and Huaming Chen},
-  year          = {2024},
+      year={2025},
-  eprint        = {2408.02479},
+      eprint={2408.02479},
-  archivePrefix = {arXiv},
+      archivePrefix={arXiv},
-  primaryClass  = {cs.SE}
+      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2408.02479}, 
 }
-% 124-paper survey from both SE and agent perspectives
+% 124-paper survey from both SE and agent perspectives (accepted at ACM TOSEM)
-@misc{yang2024llmse,
+@misc{liu2024llmse,
-  author        = {Yang, Junwei and others},
+      title={Large Language Model-Based Agents for Software Engineering: A Survey}, 
-  title         = {Large Language Model-Based Agents for Software Engineering: A Survey},
+      author={Junwei Liu and Kaixin Wang and Yixuan Chen and Xin Peng and Zhenpeng Chen and Lingming Zhang and Yiling Lou},
-  year          = {2024},
+      year={2025},
-  eprint        = {2409.02977},
+      eprint={2409.02977},
-  archivePrefix = {arXiv},
+      archivePrefix={arXiv},
-  primaryClass  = {cs.SE}
+      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2409.02977}, 
 }
-% SWE-bench — seminal benchmark for evaluating agents on real GitHub issues
+% SWE-bench — benchmark for evaluating agents on real GitHub issues (ICLR 2024)
@misc{jimenez2024swebench,
-  author        = {Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik},
+      title={SWE-bench: Can Language Models Resolve Real-World GitHub Issues?}, 
-  title         = {{SWE}-bench: Can Language Models Resolve Real-World {GitHub} Issues?},
+      author={Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik Narasimhan},
-  year          = {2024},
+      year={2024},
-  eprint        = {2310.06770},
+      eprint={2310.06770},
-  archivePrefix = {arXiv},
+      archivePrefix={arXiv},
-  primaryClass  = {cs.SE}
+      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2310.06770}, 
 }
 % -------------------------------------------------------
 % THEME 4: Planning, Reasoning & Tool Use
 % -------------------------------------------------------
 % Surveys reasoning, planning, tool-calling patterns across agent architectures
@misc{masterman2024landscape,
-  author        = {Masterman, Tula and Besen, Sandi and Sawtell, Mason and Chao, Alex},
+      title={The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey}, 
-  title         = {The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey},
+      author={Tula Masterman and Sandi Besen and Mason Sawtell and Alex Chao},
-  year          = {2024},
+      year={2024},
-  eprint        = {2404.11584},
+      eprint={2404.11584},
-  archivePrefix = {arXiv},
+      archivePrefix={arXiv},
-  primaryClass  = {cs.AI}
+      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2404.11584}, 
 }
 % Generative agents — foundational simulation of autonomous agent behaviour (UIST 2023)
@inproceedings{park2023generative,
-  author    = {Park, Joon Sung and O'Brien, Joseph C. and Cai, Carrie J. and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S.},
+author = {Park, Joon Sung and O'Brien, Joseph and Cai, Carrie Jun and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S.},
-  title     = {Generative Agents: Interactive Simulacra of Human Behavior},
+title = {Generative Agents: Interactive Simulacra of Human Behavior},
-  booktitle = {Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST '23)},
+year = {2023},
-  year      = {2023},
+isbn = {9798400701320},
-  doi       = {10.1145/3586183.3606763}
+publisher = {Association for Computing Machinery},
 address = {New York, NY, USA},
 url = {https://doi.org/10.1145/3586183.3606763},
 doi = {10.1145/3586183.3606763},
 abstract = {Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents: computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day. To enable generative agents, we describe an architecture that extends a large language model to store a complete record of the agent’s experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior. We instantiate generative agents to populate an interactive sandbox environment inspired by The Sims, where end users can interact with a small town of twenty-five agents using natural language. In an evaluation, these generative agents produce believable individual and emergent social behaviors. For example, starting with only a single user-specified notion that one agent wants to throw a Valentine’s Day party, the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time. We demonstrate through ablation that the components of our agent architecture—observation, planning, and reflection—each contribute critically to the believability of agent behavior. By fusing large language models with computational interactive agents, this work introduces architectural and interaction patterns for enabling believable simulations of human behavior.},
 booktitle = {Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology},
 articleno = {2},
 numpages = {22},
 keywords = {Human-AI interaction, agents, generative AI, large language models},
 location = {San Francisco, CA, USA},
 series = {UIST '23}
 }
 % AI agentic programming: planning, memory, tool integration, execution monitoring
-@misc{chen2025agentic,
+@misc{wang2025aiagenticprogrammingsurvey,
-  author        = {Chen, Jiannan and others},
+	title={AI Agentic Programming: A Survey of Techniques, Challenges, and Opportunities}, 
-  title         = {AI Agentic Programming: A Survey of Techniques, Challenges, and Opportunities},
+	author={Huanting Wang and Jingzhi Gong and Huawei Zhang and Jie Xu and Zheng Wang},
-  year          = {2025},
+	year={2025},
-  eprint        = {2508.11126},
+	eprint={2508.11126},
-  archivePrefix = {arXiv},
+	archivePrefix={arXiv},
-  primaryClass  = {cs.SE}
+	primaryClass={cs.SE},
-}
+	url={https://arxiv.org/abs/2508.11126}, 
 }
Author	SHA1	Message	Date
csf123321	12d8fd5de2	remove unnecessary comments from main.tex and references.bib Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-10 20:00:31 +08:00
csf123321	6252756893	fix bib	2026-05-10 17:14:18 +08:00