"The real work starts in orchestration, governance, and architecture, not in tweaking prompts or switching models."
Bharat Saxena
Lead AI Architect
NTT DATA

Enterprises spent the last two years proving that AI agents could work in controlled settings. The problem is that most of those proofs were built on POC-scale architecture, low acceptance criteria, and the assumption that production could be figured out later. Now, as organizations try to move those systems into real workflows, the failures are not coming from where most teams expect.

Bharat Saxena is Lead AI Architect at NTT DATA, a global IT services company where he designs enterprise-scale AI solutions across financial services, hospitality, and logistics. His recent work includes architecting agentic AI systems for a global hotel chain, building a multi-agent insurance platform across multiple countries, and establishing an AI Center of Excellence for a major logistics enterprise. He holds an MTech in Data Science and Engineering from BITS Pilani.

"The real work starts in orchestration, governance, and architecture. It's not a quick fix about fixing a prompt, switching a model, or swapping from AWS to Azure," said Saxena. Saxena was direct about where production failures actually originate. Teams blame the LLM first. The real culprit is almost always somewhere else.

  • Retrieval, not generation: "Most of the time, the majority of the time, it's the retrieval which is failing, not the LLM," Saxena said. "If you give the right context to the LLM, it is going to give you the right answer. But you are not even giving it the right context, and then we start saying it's hallucinating." In RAG systems, the data foundation feeding the model determines output quality far more than the model itself.

  • POC standards in production clothing: Existing production systems operate at 99.999% availability. "None of those benchmarks or standards exist for AI systems," Saxena said. "They're happy to wait five minutes as long as it gives them an answer. They're also happy to accept hallucination if it only happens a couple of times out of 10 or 15." That tolerance works in a demo. It does not work when the system is making operational decisions at scale.

The challenge intensifies in multi-agent architectures. Saxena described a pattern he encountered repeatedly: an error introduced by the first agent in a chain propagates through each subsequent agent, and by the final output, there is no way to trace back to the source.

  • The logistics example: Saxena's team worked with a U.S. logistics company optimizing trailer utilization. An agent would reroute a trailer from point A to point C based on capacity calculations. "The decision was made, but they most often fail to log why that decision was made," Saxena said. "At the end of the day, all they know is that something wrong happened. But they don't know why." Without that audit trail, the system could not learn from its own mistakes.

  • The real system of record: Saxena argued that building a proper data layer with robust logging and audit trails answers the system-of-record question automatically. "Once you have your data layer built properly, the accountability question gets answered," he said. Logs become the data points agents need to improve.

The governance gap Saxena flagged most often was not technical. It was organizational. When an agent makes a wrong decision, who is responsible? The enterprise? The agent developer? The platform provider?

  • Ten minutes of silence: "I have heard silence for 10 minutes every time we bring up this question," Saxena said. "If this happens, who is responsible? Who is going to take the risk?" Without that answer, teams default to treating production errors like traditional bugs, opening a ticket in JIRA and running a fix cycle. But AI systems are not deterministic. "You may fix one bug and end up creating two more," he said. "The same bug can appear in 100 different variations until you find the root cause."

  • Governance embedded, not bolted on: Saxena pushed for governance decisions to happen early in the lifecycle, not deferred until scale forces the issue. "Right now, everything gets deferred. 'We'll cross that bridge when we come to it.' That mindset has to change," he said. "It will drive those technology and architectural decisions earlier in the lifecycle."

Saxena closed with a warning about a slower-moving risk most teams are not yet accounting for. As LLM inference costs drop, enterprises are building prompts, workflows, and codebases tuned to specific providers. "These companies are still in market-making phase," he said. "The costs we see are not the costs in two or three years. And after years of fine-tuning your code and prompts for a particular LLM, you cannot just switch overnight. You have to be model-agnostic, and that has to be a conscious effort from the start."