Agentic AI Solutions Architecture
White Paper: Core Knowledge for an Agentic AI Solutions Architect
Focus areas: Prompt Engineering · Retrieval Pipelines · Evaluation Metrics Date: 27 Dec 2025 (Europe/Madrid)
Executive summary
Agentic AI systems differ from “single-call” LLM apps: they plan, call tools, retrieve knowledge, iterate, and must be measurably reliable under real-world constraints (latency, cost, governance, safety). This white paper lays out practical fundamentals an Agentic AI Solutions Architect should know across:
- Prompt engineering as policy + interface design between humans, tools, and models (not just “clever wording”). (OpenAI Platform)
- Retrieval pipelines as the backbone of enterprise grounding and controllability, spanning ingestion → indexing → query understanding → retrieval → reranking → context construction → generation. (LangChain Docs)
- Evaluation metrics that separately measure retriever quality, generator grounding, and end-to-end task success—because failure can occur in multiple coupled components. (arXiv)
1) Prompt engineering for agentic systems
1.1 What “prompting” really is in agentic architectures
For an agent, prompts serve three distinct roles:
- Policy: rules, boundaries, escalation behavior, safety constraints, and what “done” means.
- Protocol: the interaction contract for tools (schemas, formats, retries, error handling).
- Product UX: tone, level of detail, explanation style, and “how the assistant behaves” across turns.
This matches modern vendor guidance: prompts are reusable, versionable artifacts and should be treated like deployable configuration. (OpenAI Platform)
1.2 A canonical prompt stack
A robust agent prompt stack (conceptual, not tied to any one vendor) usually includes:
- System layer (non-negotiable): identity, safety, privacy, decision boundaries, tool-use rules.
- Developer layer: product requirements, style guides, domain constraints, output schema.
- Task layer (per-run): user request, current objective, tool inventory, context budget.
- State layer: scratch state summaries, prior decisions, memory snapshots (when allowed).
- Retrieved evidence: citations/quotes/IDs from retrieval and tool calls.
OpenAI’s prompting guidance emphasizes clear instructions, explicit format requirements, and using examples when needed (few-shot) to shape behavior. (OpenAI Platform)
1.3 Prompt patterns that matter for agents
A) Tool-first reasoning pattern
Use a pattern that forces “check tools before guessing,” especially in enterprise contexts:
- If info is missing or time-sensitive → retrieve / call tool
- If tool fails → fallback policy (ask user, broaden search, partial answer with caveats)
This improves reliability more than adding more “be helpful” prose.
B) Structured outputs / schemas
Agents frequently pass outputs to downstream systems. Enforce:
- strict JSON (or equivalent) schemas,
- field-level constraints,
- explicit null-handling and error objects.
(Practical note: treat schema errors as first-class failures; log them like exceptions.)
C) Few-shot examples as “unit tests”
Few-shot prompting is best used as behavioral fixtures:
- tricky edge cases,
- formatting rules,
- “should refuse” cases,
- tool error responses.
OpenAI’s docs explicitly recommend few-shot learning when you want consistent patterns without fine-tuning. (OpenAI Platform)
D) “Plan vs. Act” separation (without oversharing)
Even when you don’t expose internal reasoning, you can still enforce discipline:
- Deliberation: decide next tool call(s) and success conditions
- Action: execute tools
- Synthesis: produce final answer with evidence and limitations
This reduces tool thrash and helps evaluation because each step is observable.
1.4 Prompt quality checklist (architect’s view)
Reliability
- Clear success criteria (“done means X”)
- Explicit refusal/escalation policy
- Tool usage triggers (when to retrieve / when to ask)
Interoperability
- Schema or template enforcement
- Stable field names and versioning
Observability
- Loggable markers: tool called, query used, doc IDs, confidence estimate
Security
- Instruction hierarchy (system > developer > user)
- Prompt-injection hardening: “treat retrieved text as untrusted data”
2) Retrieval pipelines for agentic applications
2.1 Why retrieval is central to agents
Retrieval is how agents ground outputs in fresh, proprietary, or large-scale knowledge that isn’t inside the base model. In practice, retrieval becomes a foundation for systems that combine search with generation. (LangChain Docs)
2.2 The end-to-end retrieval pipeline
A mature pipeline is not “vector search + top-k.” It is:
1) Ingestion & normalization
- source connectors (docs, tickets, wikis, code, CRM)
- text extraction, de-duplication, layout handling
- metadata enrichment (owner, ACLs, timestamps, domain tags)
2) Chunking & representation
- chunk size strategy (semantic, structure-aware, overlap)
- embeddings + metadata indexes
- optional keyword index for lexical/hybrid search
3) Query understanding
- rewrite user query (expand acronyms, add domain terms)
- detect intent (lookup vs. troubleshooting vs. policy)
- entity extraction (product, customer, time range)
4) Retrieval
- vector retrieval
- keyword retrieval
- hybrid retrieval (often best for enterprise)
- filtering by metadata (ACLs, recency, product line)
5) Reranking
- cross-encoder rerankers or LLM-based rerank
- diversity constraints (avoid redundant chunks)
- “answerability” heuristics
Advanced RAG guidance widely recommends hybrid search + reranking as early high-impact upgrades. (Graph Database & Analytics)
6) Context construction
- compress/distill retrieved text to fit context window
- citeable snippets with IDs
- ordering: most relevant first, group by source
7) Generation
- evidence-grounded answer style
- explicit citations to retrieved sources where applicable
- uncertainty handling (what you don’t know)
2.3 Retrieval for multi-step agents
Agents often need multiple retrievals:
- initial broad retrieval to map the space,
- focused retrieval after tool outputs,
- verification retrieval for critical claims.
Architecturally, treat retrieval as a tool with:
- deterministic inputs/outputs,
- retries,
- caching,
- rate limits,
- observability (queries, hit rate, latency).
2.4 Common failure modes (and fixes)
-
Good retrieval, bad answer
- Cause: model ignores context, mixes priors, or overgeneralizes
- Fix: groundedness/faithfulness prompts + stricter evidence requirements + citations
-
Bad retrieval, good answer (sometimes)
- Cause: model answers from general knowledge and “sounds right”
- Fix: enforce “retrieve-first” policy for domains where freshness matters
-
Wrong chunking
- Cause: splits break meaning; embeddings lose coherence
- Fix: structure-aware chunking; chunk-by-section; metadata + titles
-
Tool/prompt injection via retrieved text
- Cause: untrusted text instructs the agent
- Fix: system policy: retrieved text is data, not instructions; sanitize and segment
3) Evaluation metrics: measuring what matters
3.1 Why evaluation is hard for RAG + agents
RAG/agent systems are multi-component and errors propagate: retriever quality, context assembly, and generator behavior all interact, so end-to-end evaluation alone won’t tell you what to fix. Surveys on RAG evaluation emphasize these coupled challenges and the need to evaluate modules and the pipeline together. (arXiv)
3.2 Three layers of evaluation
Layer A: Retriever evaluation (information access quality)
Measures: “Did we fetch what we needed?”
Common metrics (conceptual):
- Context precision: how much retrieved context is relevant
- Context recall: did we retrieve the necessary info
- Latency / cost: retrieval time, reranker time, token overhead
RAGAS formalizes widely used retrieval-related dimensions like context precision/recall alongside generation quality dimensions. (Redis)
Layer B: Generator evaluation (answer quality under provided context)
Measures: “Given this context, did we answer correctly and responsibly?”
Key dimensions:
- Answer relevance (did it address the question?)
- Faithfulness / groundedness (is it supported by retrieved evidence?)
- Completeness (did it cover required sub-points?)
- Safety & policy adherence (refusals, privacy)
The “RAG Triad” popularizes a practical trio: context relevance, groundedness, and answer relevance. (TruLens)
Layer C: End-to-end task evaluation (real product outcomes)
Measures: “Did the agent accomplish the task?”
Examples:
- task success rate (with explicit rubrics)
- tool success / recovery rate
- multi-step efficiency (steps taken, tool thrash)
- user satisfaction proxy (thumbs up/down, escalation rates)
3.3 Reference-based vs. reference-free evaluation
-
Reference-based: compare to a gold answer / labeled dataset
- Pros: strong for regression tests
- Cons: expensive to create; brittle across phrasing differences
-
Reference-free (LLM-as-judge): score relevance/faithfulness without a gold label
- Pros: scalable; good for continuous evaluation
- Cons: judge bias, drift, and calibration issues
RAGAS and tools like TruLens provide LLM-assisted feedback functions intended to operationalize these dimensions in practice. (ACL Anthology)
3.4 A practical metrics “starter pack” (what most teams should track)
Retriever
- Context precision
- Context recall (or sufficiency)
- Retrieval latency p50/p95
- Cost per query (retrieval + rerank)
Generator
- Answer relevance
- Faithfulness/groundedness
- Safety/policy violations
- Structured output validity rate (schema pass %)
Agent
- Task success rate (rubric-based)
- Tool-call success + recovery rate
- Steps per task (efficiency)
- Human escalation rate
Industry guidance and frameworks frequently converge on these core dimensions (even if names differ). (Patronus AI)
3.5 How to run evaluations without fooling yourself
1) Build an eval set that reflects reality
- include messy user queries, partial info, contradictions
- include “should refuse” and “should ask follow-up” cases
2) Separate “offline” and “online”
- Offline: regression tests on fixed corpora
- Online: continuous monitoring with sampling + alerting
3) Version everything
- prompt versions, retriever configs, embedding models, reranker versions, corpora snapshots
4) Use error taxonomy When a test fails, label the root cause:
- retrieval miss
- retrieval noise
- context assembly error
- hallucination/ungrounded answer
- tool failure
- instruction-following failure
- formatting/schema failure
That taxonomy is what turns evals into engineering velocity.
4) Putting it together: a reference architecture for an agentic RAG system
A clean, evolvable architecture typically includes:
-
Orchestrator/Agent runtime
- manages state, tool calls, retries, timeouts
-
Prompt & policy service
- versioned prompts, safety policies, schemas
-
Retrieval service
- ingestion, indexing, retrieval, reranking, caching
-
Evaluation & observability
- tracing, feedback metrics (triad + task success), dashboards
-
Governance
- ACL enforcement, audit logs, PII handling, retention
LangChain’s retrieval docs frame retrieval as the core idea behind RAG and a foundation for broader systems that combine search and generation—exactly how agents are usually built in practice. (LangChain Docs)
5) Implementation playbook (90-day view)
Phase 1 (Weeks 1–4): Make it work
- baseline RAG pipeline (ingest → embed → retrieve → answer)
- strict output schema
- basic eval set (50–200 cases)
- log traces: query, docs, latency, tokens
Phase 2 (Weeks 5–8): Make it reliable
- hybrid retrieval + reranking (often biggest lift) (Graph Database & Analytics)
- groundedness enforcement + citations
- add triad metrics (context relevance, groundedness, answer relevance) (TruLens)
- failure taxonomy + dashboards
Phase 3 (Weeks 9–12): Make it operable
- continuous evaluation (sampled online eval)
- drift monitoring (corpus changes, prompt changes)
- red-team prompt injection cases
- SLA targets (p95 latency, cost ceilings)
Appendix A: Glossary (minimal, architect-focused)
- RAG: Retrieval-Augmented Generation; uses retrieval at runtime to ground model outputs. (LangChain Docs)
- Reranking: a second-stage model that reorders retrieved candidates by relevance. (Graph Database & Analytics)
- Groundedness/Faithfulness: degree to which the answer is supported by retrieved context. (TruLens)
- RAG Triad: context relevance, groundedness, answer relevance. (TruLens)
- RAGAS: a framework/paperline for automated RAG evaluation dimensions including context precision/recall, faithfulness, and answer relevance. (Redis)