In Part 1, we asked three diagnostic questions. If you worked through them with your team, you now know which context layers are missing from your most important agent, whether context rot is already degrading your production outputs, and whether your inference costs are scaling out of control. Most organisations skip that diagnosis entirely — they go from pilot to optimisation without knowing what they are optimising. If you have the answers, even rough ones, you are ahead of most enterprise AI programmes in Europe.
Now comes the architecture. This post covers the four design decisions that separate Level 2 "structured" from Level 3 "engineered" context management: memory architecture, compression strategy, human oversight design, and multi-agent orchestration. Together, they determine whether your AI agents work reliably in production — or just occasionally.
The First Decision: Memory Architecture
The most consequential architecture decision for an enterprise AI agent is not which model to run. It is how to design the memory system. In most enterprise AI projects, memory is not designed at all — it evolves. Someone configures a document search pipeline. Someone else adds conversation history. A third team adds a user profile lookup. The result is a context window packed with information from three different systems in three different formats, with no logic for what actually belongs there. An agent accurate for three exchanges and incoherent by exchange fifteen.
Agents need access to three distinct categories of information — each with a different update rate, a different urgency, and a different home:
| Layer | What It Contains | When It Updates |
|---|---|---|
| Live Session | Active conversation, current task state, outputs from tools called in this session | Every exchange |
| User Profile | Preferences, language, contract tier, recent session summaries, open tasks | Per session or daily |
| Knowledge Base | Product data, policy documents, regulatory guidelines, compliance handbooks, FAQs | On change — weekly or less |
The common failure: everything ends up in the Knowledge Base. One document index, loaded en masse for every query. The agent retrieves the five most similar chunks regardless of whether they are relevant to this user, this moment, or this task. The correct architecture is selective by design — for a billing dispute, the Live Session carries the current conversation and the specific invoice data; the User Profile loads the customer's language preference and open ticket history; the Knowledge Base retrieves only the two or three policy sections closest to this dispute type. Each layer's footprint is managed explicitly, not left to grow.
This is not a retrieval problem. It is a design problem. The question is not "how do we find the right information?" It is: "what are the right categories of information, and where does each one live?"
The Second Decision: Context Compression
Once you have structured memory layers, the next problem surfaces quickly: contexts grow. Long sessions accumulate, overnight autonomous processes compound, and without active management every context reaches the same destination — bloat, slower responses, higher cost, degraded quality. The WSCI framework — Window, Summarise, Compress, Isolate — keeps contexts lean without losing continuity.
Window
Keep only the most recent exchanges in full detail; earlier ones are summarised or dropped. For most enterprise transactional workflows — support tickets, approval chains, standard queries — eight to twelve exchanges is sufficient. The discipline: the window size is explicit and enforced, not left to grow because no one set a limit.
Summarise
Rather than discarding older exchanges entirely, compress them into structured summaries. A fifteen-exchange billing escalation becomes: "Customer: Q3 invoice dispute, €12,400. Issue: incorrect VAT rate. Resolution: credit note issued. Status: awaiting finance sign-off." Roughly forty words replacing 1,800 tokens. The agent keeps what it needs. The window keeps space for what comes next.
Compress
Strip system data down before it enters the context. A raw SAP S/4HANA response might return 4,000 tokens of XML. Structured extraction — pulling only the fields relevant to the current task — reduces this to 200 tokens. Same information, 95% lower cost. No model change required.
Isolate
Keep information categories in clearly labelled sections: instructions at the top, retrieved documents next, tool outputs labelled with source and timestamp, conversation history last. This is not cosmetic — it determines where the model places its attention. An instruction buried mid-conversation is followed less reliably than the same instruction at the top, before anything else.
A production system implementing all four WSCI elements typically sees a 40–70% reduction in per-interaction token cost with no measurable quality loss — often with quality improvements, because the model attends to the right information rather than everything it has ever seen.
The Third Decision: Human Oversight Design
No enterprise AI agent should take a consequential action — send an email, update a CRM record, initiate a payment, approve a request — without a human checkpoint. This is not a regulatory concession. It is a quality decision. The human review step is not overhead. It is the highest-signal training data in your system.
When a support lead edits an AI-drafted reply before sending it, that edit encodes something no model training ever could: how your organisation actually communicates, what your brand voice sounds like, which regulatory nuance was missed for this specific customer. That information is worth more than any amount of generic fine-tuning on external datasets.
The implication: instrument your review step to capture every edit. Store the AI draft and the sent version side by side. Edits where the human changes between 5% and 50% of the content are your highest-quality signal — close enough to be informative, different enough to contain genuine correction. Use them as examples for similar future interactions. Quality compounds over time in a way that is specific to your organisation, not available to anyone else.
The human edit is not overhead. It is the mechanism by which your AI agent becomes specifically accurate for your organisation — in a way no generic model, however capable, can replicate by default.
For European enterprises, there is a second dimension: regulatory accountability. The EU AI Act and sector frameworks across financial services, insurance, pharma, and medical devices all require demonstrable human oversight for high-risk AI decisions. A well-designed review architecture is both a quality mechanism and a compliance record — from a single engineering decision.
The Fourth Decision: Multi-Agent Orchestration
Most enterprise workflows are not single-step. A contract review requires legal analysis, financial risk, compliance verification, and an executive summary — in parallel. A supply chain disruption requires logistics alternatives, supplier communication, customer notification, and inventory reallocation — simultaneously. A single agent cannot handle this well: the context collapses under the breadth, and the specialised knowledge each sub-task requires cannot cleanly coexist in one context window.
Multi-agent architectures distribute work across specialised agents coordinated by an orchestrator. Three patterns matter for enterprise use:
The orchestrator assigns the same input — a document, a case, a query — to multiple specialised agents simultaneously. A contract is reviewed by a legal agent, a financial risk agent, and a compliance agent — all at the same time. Each agent operates with a clean, task-specific context containing only what it needs. Results return to the orchestrator and are synthesised into a unified output. Total elapsed time is the duration of the slowest sub-agent — not the sum of all three.
Best for: multi-dimensional analysis where the same input needs independent evaluation from several expert perspectives.
The output of one agent becomes the structured input to the next. Document Parser → Entity Extractor → Compliance Checker → Approval Drafter. Each agent in the chain receives exactly the information it needs — no more — as structured input from the prior stage. The critical engineering discipline is the handoff format: each agent's output must be explicitly structured for the next agent's context, not left as free-form text that the next agent has to re-interpret.
Best for: document-intensive industries — insurance claims, regulatory filings, pharmaceutical batch records, contract generation from templates.
All agents in a workflow share a common read-only context layer — the customer account, the applicable regulatory framework, the project brief — while maintaining entirely separate contexts for their specialised tasks. This prevents duplication, ensures consistency across the workflow, and keeps each agent's operational context lean.
Best for: long-running enterprise workflows where multiple agents need the same foundational information but different working contexts over time.
The most common multi-agent failure is context bleed: sub-agents receive information from other agents' tasks that has no bearing on their own work. The fix is strict context scoping at the orchestration layer — each sub-agent gets only what its specific task requires, nothing carried over from its siblings.
The Underrated Requirement: Context Visibility
None of the above can be optimised if you cannot see what is inside the context at the moment the model responds. Context visibility is a production monitoring mechanism — not a debugging tool. It shows you exactly what was assembled for any given interaction: which documents were retrieved, which layer they came from, how many tokens each layer consumed, what the full prompt looked like before it reached the model.
In one deployment, the user profile layer was silently returning empty results due to a database index failure. The agent operated without any customer context for eleven days before anyone noticed. The quality degradation was gradual enough to be misattributed to model drift — a vague problem with no clear owner. Once a context visualiser surfaced the empty slot, the fix took twenty minutes. Eleven days of degraded output that did not need to happen.
For European enterprises, context visibility also serves as an audit trail. GDPR data subject access requests require reconstructing which customer data entered which AI interaction. EU AI Act Article 13 requires documentation of AI inputs, not just outputs. A context log satisfies both — automatically — if the architecture captures it from the start.
You cannot comply with what you cannot log. You cannot optimise what you cannot see.
What Level 3 Actually Looks Like
A properly context-engineered enterprise agent has these properties — all achievable:
- Every layer is explicitly designed, not emergent. Memory categories defined. Compression thresholds set. Isolation enforced.
- Cost per interaction is known and stable — it does not scale steeply with session length.
- Response quality at exchange twenty is measurably close to exchange three.
- Every human edit is captured and recycled as a signal for future similar interactions.
- Every AI interaction has a complete, queryable context log satisfying audit and compliance requirements.
- The system knows what it does not know — and routes to human review rather than guessing.
Most teams are not there yet. That is expected. Moving from Level 1 to Level 3 is a programme of work, not a sprint. The four decisions above are the sequence.
Three Steps to Take This Week
Three concrete steps, each actionable within a single sprint:
Build a context log for your most critical agent interaction
It does not need to be sophisticated — just capture the full assembled prompt, token count per layer, and model response, searchable by session ID and timestamp. Build this first, before optimising anything else. Every subsequent decision will be better informed.
Run a compression audit on your highest-cost system calls
Identify the three data calls that return the most tokens in a typical session. Extract only the fields the agent actually needs before those outputs enter the context. Measure token count before and after. In most deployments, this single step cuts per-interaction cost by 20–40% — no model change required.
Log your human review edits
Wherever a human reviews AI output before action — emails, summaries, reports, recommendations — log both versions and compute the edit percentage. Identify edits in the 5–50% range. Use a selection of them as examples in future interactions. Measure quality against your current baseline over two weeks.
None of these require a new model, vendor, or framework. They require engineering attention directed at the right layer — the context layer.
Coming in Part 3
The final post covers what happens when context engineering alone is not enough. Topics include: why a smaller, carefully curated knowledge base consistently outperforms a comprehensive one; retrieval approaches beyond standard document search; reinforcement learning from human feedback using open-source models; and voice agents under real-world latency constraints — including the counterintuitive finding that for well-structured knowledge bases, loading content directly into the system prompt often outperforms retrieval.
We will also introduce the Care Framework: the human practices — not engineering ones — that separate AI products organisations genuinely use from those quietly deprecated three months after launch.
Part 3 publishes next month. Subscribe at prodata.ai/insights to receive it directly.