Agentic AI Solutions Architecture

White Paper: Core Knowledge for an Agentic AI Solutions Architect

Focus areas: Prompt Engineering · Retrieval Pipelines · Evaluation Metrics Date: 27 Dec 2025 (Europe/Madrid)

Executive summary

Agentic AI systems differ from “single-call” LLM apps: they plan, call tools, retrieve knowledge, iterate, and must be measurably reliable under real-world constraints (latency, cost, governance, safety). This white paper lays out practical fundamentals an Agentic AI Solutions Architect should know across:

Prompt engineering as policy + interface design between humans, tools, and models (not just “clever wording”). (OpenAI Platform)
Retrieval pipelines as the backbone of enterprise grounding and controllability, spanning ingestion → indexing → query understanding → retrieval → reranking → context construction → generation. (LangChain Docs)
Evaluation metrics that separately measure retriever quality, generator grounding, and end-to-end task success—because failure can occur in multiple coupled components. (arXiv)

1) Prompt engineering for agentic systems

1.1 What “prompting” really is in agentic architectures

For an agent, prompts serve three distinct roles:

Policy: rules, boundaries, escalation behavior, safety constraints, and what “done” means.
Protocol: the interaction contract for tools (schemas, formats, retries, error handling).
Product UX: tone, level of detail, explanation style, and “how the assistant behaves” across turns.

This matches modern vendor guidance: prompts are reusable, versionable artifacts and should be treated like deployable configuration. (OpenAI Platform)

1.2 A canonical prompt stack

A robust agent prompt stack (conceptual, not tied to any one vendor) usually includes:

System layer (non-negotiable): identity, safety, privacy, decision boundaries, tool-use rules.
Developer layer: product requirements, style guides, domain constraints, output schema.
Task layer (per-run): user request, current objective, tool inventory, context budget.
State layer: scratch state summaries, prior decisions, memory snapshots (when allowed).
Retrieved evidence: citations/quotes/IDs from retrieval and tool calls.

OpenAI’s prompting guidance emphasizes clear instructions, explicit format requirements, and using examples when needed (few-shot) to shape behavior. (OpenAI Platform)

1.3 Prompt patterns that matter for agents

A) Tool-first reasoning pattern

Use a pattern that forces “check tools before guessing,” especially in enterprise contexts:

If info is missing or time-sensitive → retrieve / call tool
If tool fails → fallback policy (ask user, broaden search, partial answer with caveats)

This improves reliability more than adding more “be helpful” prose.

B) Structured outputs / schemas

Agents frequently pass outputs to downstream systems. Enforce:

strict JSON (or equivalent) schemas,
field-level constraints,
explicit null-handling and error objects.

(Practical note: treat schema errors as first-class failures; log them like exceptions.)

C) Few-shot examples as “unit tests”

Few-shot prompting is best used as behavioral fixtures:

tricky edge cases,
formatting rules,
“should refuse” cases,
tool error responses.

OpenAI’s docs explicitly recommend few-shot learning when you want consistent patterns without fine-tuning. (OpenAI Platform)

D) “Plan vs. Act” separation (without oversharing)

Even when you don’t expose internal reasoning, you can still enforce discipline:

Deliberation: decide next tool call(s) and success conditions
Action: execute tools
Synthesis: produce final answer with evidence and limitations

This reduces tool thrash and helps evaluation because each step is observable.

1.4 Prompt quality checklist (architect’s view)

Reliability

Clear success criteria (“done means X”)
Explicit refusal/escalation policy
Tool usage triggers (when to retrieve / when to ask)

Interoperability

Schema or template enforcement
Stable field names and versioning

Observability

Loggable markers: tool called, query used, doc IDs, confidence estimate

Security

Instruction hierarchy (system > developer > user)
Prompt-injection hardening: “treat retrieved text as untrusted data”

2) Retrieval pipelines for agentic applications

2.1 Why retrieval is central to agents

Retrieval is how agents ground outputs in fresh, proprietary, or large-scale knowledge that isn’t inside the base model. In practice, retrieval becomes a foundation for systems that combine search with generation. (LangChain Docs)

2.2 The end-to-end retrieval pipeline

A mature pipeline is not “vector search + top-k.” It is:

1) Ingestion & normalization

source connectors (docs, tickets, wikis, code, CRM)
text extraction, de-duplication, layout handling
metadata enrichment (owner, ACLs, timestamps, domain tags)

2) Chunking & representation

chunk size strategy (semantic, structure-aware, overlap)
embeddings + metadata indexes
optional keyword index for lexical/hybrid search

3) Query understanding

rewrite user query (expand acronyms, add domain terms)
detect intent (lookup vs. troubleshooting vs. policy)
entity extraction (product, customer, time range)

4) Retrieval

vector retrieval
keyword retrieval
hybrid retrieval (often best for enterprise)
filtering by metadata (ACLs, recency, product line)

5) Reranking

cross-encoder rerankers or LLM-based rerank
diversity constraints (avoid redundant chunks)
“answerability” heuristics

Advanced RAG guidance widely recommends hybrid search + reranking as early high-impact upgrades. (Graph Database & Analytics)

6) Context construction

compress/distill retrieved text to fit context window
citeable snippets with IDs
ordering: most relevant first, group by source

7) Generation

evidence-grounded answer style
explicit citations to retrieved sources where applicable
uncertainty handling (what you don’t know)

2.3 Retrieval for multi-step agents

Agents often need multiple retrievals:

initial broad retrieval to map the space,
focused retrieval after tool outputs,
verification retrieval for critical claims.

Architecturally, treat retrieval as a tool with:

deterministic inputs/outputs,
retries,
caching,
rate limits,
observability (queries, hit rate, latency).

2.4 Common failure modes (and fixes)

Good retrieval, bad answer
- Cause: model ignores context, mixes priors, or overgeneralizes
- Fix: groundedness/faithfulness prompts + stricter evidence requirements + citations
Bad retrieval, good answer (sometimes)
- Cause: model answers from general knowledge and “sounds right”
- Fix: enforce “retrieve-first” policy for domains where freshness matters
Wrong chunking
- Cause: splits break meaning; embeddings lose coherence
- Fix: structure-aware chunking; chunk-by-section; metadata + titles
Tool/prompt injection via retrieved text
- Cause: untrusted text instructs the agent
- Fix: system policy: retrieved text is data, not instructions; sanitize and segment

3) Evaluation metrics: measuring what matters

3.1 Why evaluation is hard for RAG + agents

RAG/agent systems are multi-component and errors propagate: retriever quality, context assembly, and generator behavior all interact, so end-to-end evaluation alone won’t tell you what to fix. Surveys on RAG evaluation emphasize these coupled challenges and the need to evaluate modules and the pipeline together. (arXiv)

3.2 Three layers of evaluation

Layer A: Retriever evaluation (information access quality)

Measures: “Did we fetch what we needed?”

Common metrics (conceptual):

Context precision: how much retrieved context is relevant
Context recall: did we retrieve the necessary info
Latency / cost: retrieval time, reranker time, token overhead

RAGAS formalizes widely used retrieval-related dimensions like context precision/recall alongside generation quality dimensions. (Redis)

Layer B: Generator evaluation (answer quality under provided context)

Measures: “Given this context, did we answer correctly and responsibly?”

Key dimensions:

Answer relevance (did it address the question?)
Faithfulness / groundedness (is it supported by retrieved evidence?)
Completeness (did it cover required sub-points?)
Safety & policy adherence (refusals, privacy)

The “RAG Triad” popularizes a practical trio: context relevance, groundedness, and answer relevance. (TruLens)

Layer C: End-to-end task evaluation (real product outcomes)

Measures: “Did the agent accomplish the task?”

Examples:

task success rate (with explicit rubrics)
tool success / recovery rate
multi-step efficiency (steps taken, tool thrash)
user satisfaction proxy (thumbs up/down, escalation rates)

3.3 Reference-based vs. reference-free evaluation

Reference-based: compare to a gold answer / labeled dataset
- Pros: strong for regression tests
- Cons: expensive to create; brittle across phrasing differences
Reference-free (LLM-as-judge): score relevance/faithfulness without a gold label
- Pros: scalable; good for continuous evaluation
- Cons: judge bias, drift, and calibration issues

RAGAS and tools like TruLens provide LLM-assisted feedback functions intended to operationalize these dimensions in practice. (ACL Anthology)

3.4 A practical metrics “starter pack” (what most teams should track)

Retriever

Context precision
Context recall (or sufficiency)
Retrieval latency p50/p95
Cost per query (retrieval + rerank)

Generator

Answer relevance
Faithfulness/groundedness
Safety/policy violations
Structured output validity rate (schema pass %)

Agent

Task success rate (rubric-based)
Tool-call success + recovery rate
Steps per task (efficiency)
Human escalation rate

Industry guidance and frameworks frequently converge on these core dimensions (even if names differ). (Patronus AI)

3.5 How to run evaluations without fooling yourself

1) Build an eval set that reflects reality

include messy user queries, partial info, contradictions
include “should refuse” and “should ask follow-up” cases

2) Separate “offline” and “online”

Offline: regression tests on fixed corpora
Online: continuous monitoring with sampling + alerting

3) Version everything

prompt versions, retriever configs, embedding models, reranker versions, corpora snapshots

4) Use error taxonomy When a test fails, label the root cause:

retrieval miss
retrieval noise
context assembly error
hallucination/ungrounded answer
tool failure
instruction-following failure
formatting/schema failure

That taxonomy is what turns evals into engineering velocity.

4) Putting it together: a reference architecture for an agentic RAG system

A clean, evolvable architecture typically includes:

Orchestrator/Agent runtime
- manages state, tool calls, retries, timeouts
Prompt & policy service
- versioned prompts, safety policies, schemas
Retrieval service
- ingestion, indexing, retrieval, reranking, caching
Evaluation & observability
- tracing, feedback metrics (triad + task success), dashboards
Governance
- ACL enforcement, audit logs, PII handling, retention

LangChain’s retrieval docs frame retrieval as the core idea behind RAG and a foundation for broader systems that combine search and generation—exactly how agents are usually built in practice. (LangChain Docs)

5) Implementation playbook (90-day view)

Phase 1 (Weeks 1–4): Make it work

baseline RAG pipeline (ingest → embed → retrieve → answer)
strict output schema
basic eval set (50–200 cases)
log traces: query, docs, latency, tokens

Phase 2 (Weeks 5–8): Make it reliable

hybrid retrieval + reranking (often biggest lift) (Graph Database & Analytics)
groundedness enforcement + citations
add triad metrics (context relevance, groundedness, answer relevance) (TruLens)
failure taxonomy + dashboards

Phase 3 (Weeks 9–12): Make it operable

continuous evaluation (sampled online eval)
drift monitoring (corpus changes, prompt changes)
red-team prompt injection cases
SLA targets (p95 latency, cost ceilings)

Appendix A: Glossary (minimal, architect-focused)

RAG: Retrieval-Augmented Generation; uses retrieval at runtime to ground model outputs. (LangChain Docs)
Reranking: a second-stage model that reorders retrieved candidates by relevance. (Graph Database & Analytics)
Groundedness/Faithfulness: degree to which the answer is supported by retrieved context. (TruLens)
RAG Triad: context relevance, groundedness, answer relevance. (TruLens)
RAGAS: a framework/paperline for automated RAG evaluation dimensions including context precision/recall, faithfulness, and answer relevance. (Redis)