By Dickey Singh, CEO and Founder, Cast.app
The AI agent that demoed flawlessly at last quarter's executive review just sent your largest renewing customer a proposal addressed to a champion who left in March, proposed a Friday meeting at an account that doesn't take Friday meetings, and cited a renewal number that doesn't match the contract.
You watch the replay. The model still works. The retrieval still works. The prompt is identical to the one your team approved last quarter. Nothing has changed in any of the layers anyone is monitoring.
The agent has just quietly degraded.
That is the gap between a demo and production.
It is not a model problem.
It is not a prompt problem.
The model is not getting dumber.
The substrate around the model is rotting — and the substrate has four layers, each rotting on its own timeline.
Most teams say "context engineering" or "prompt engineering" when they actually mean four distinct jobs. The conflation is the reason agents look great in demos and degrade in production.
The four, by time horizon:
Each has its own failure modes and its own skill set.
Most teams do prompt confidently, context accidentally, memory rarely, and harness never.
That distribution is exactly why agents demo well and break in production.
The instruction itself. Wording, role framing, constraints, examples. The briefing note slid across the desk for this single turn.
How it fails in production: brittle phrasing, format drift, scope creep, and silent breakage after a model upgrade. The agent did not change. Its substrate did.
The verdict: table stakes. Most teams are competent here, which is exactly why most teams over-credit it. Prompts are the cheapest of the four crafts to master and the easiest to mistake for the whole job.
What information is in the window right now. Retrieval, ranking, ordering, compression. What you lay on the desk before the agent starts the task.
How it fails in production: window-stuffing, retrieval misses, the right facts buried behind the wrong ones. Bad retrieval looks like a dumb model. The model is fine. The wrong evidence was loaded.
The verdict: bifurcating. The infrastructure — vector databases, rerankers, million-token windows — is converging on commodity. The assembly policy — what to load, in what order, compressed how, for which turn — is not. But assembly policy is downstream of memory: you cannot rank what to load without knowing what the agent already knows. Vendors who lead with context as their moat are usually revealing they have no memory layer underneath.
What persists across sessions, customers, and years. The filing cabinet across the hall. What to store, what to update, what to forget, and who is allowed to see what.
The hardest question in memory is not what to store. It is what not to store. Bad memory poisons future reasoning.
How it fails in production: stale facts treated as current, cross-tenant leakage, the agent eating its own outputs and treating yesterday's hallucination as today's source of truth. This is the layer where the renewal proposal goes to the champion who left in March. The model is fine. The filing cabinet is full of forged documents. The bill arrives in churned renewals, expansion lost to misfires, and CSM hours spent unwinding the agent's mistakes.
The verdict: the least mature layer in the industry. Most products that claim to have "memory" have a key-value store and a vibes-based retrieval policy. Real memory engineering involves typed schemas, conflict resolution, decay rules, and tenant scoping enforced at the storage layer — the kind of work that takes years to get right at enterprise scale. It compounds over time and creates switching costs no clever prompt can match.
The loop the agent runs inside. Tool design, retries, validators, sub-agent routing, termination criteria. The plumbing that decides when "done" means done.
The raw model is a probabilistic text engine. The harness turns it into an operational system.
How it fails in production: silent success — the agent reports "done" on a ticket it never actually resolved. Runaway loops. Tool failures that do not propagate. Refund actions that should have been blocked. Frameworks like LangGraph, CrewAI, and AutoGen are useful scaffolding and still evolving. Almost no one builds production-grade harnesses with typed tool contracts, explicit failure-mode handling, and idempotent rollback for high-stakes actions.
The verdict: becoming the operating system of agentic AI and the single largest source of differentiation in production systems.
Prompt is table stakes. Context infrastructure is converging. The vendors loudest about "context engineering" are usually the ones structurally unable to do memory and harness well, because their architecture was built around a CRM or a CSP — where memory is just data and harness is just workflows, neither designed for agentic use. CRM data lacks typed decay and conflict resolution. Workflows lack probabilistic termination logic and failure-mode handling.
Memory and harness are where durable moats live. They do not yield to a clever retrieval trick. They require years of schema work, conflict resolution, and operational hardening. They cannot be retrofitted onto a system that was not designed for them.
Three questions to ask any AI vendor pitching your post-sales org:
Most vendors fail all three. The ones who answer cleanly are the ones to bet on.
These same questions also serve as a self-test if you are weighing whether to build. The strategic side of that decision — which agents are core to your business and which are coverage — has its own scorecard. Build vs. buy AI agents: 10-point scorecard →
Prompt is per-turn. Context is per-turn assembly. Memory persists across years. Harness governs every loop the agent will ever run.
Bet on the disciplines that compound. Memory and harness over prompt and context.
Dickey Singh is the founder of Cast (cast.app), where the team is building agentic CS infrastructure for enterprise customers including HPE, Pure Storage, Cloudera, and CDK Global. Cast itself runs on agentic engineering — the team uses Cursor, Claude Code, Codex, and GitHub bots daily. Cast breaks the headcount cycle: more accounts no longer means more CSMs.