

Frontier reasoning models are getting measurably better at complex tasks. GPT-5.5, Opus 4.7, Mythos, and their successors handle longer context, reason across more steps, and produce outputs that read with striking confidence. But for enterprises under strict regulatory oversight, confidence is the problem. A model that sounds authoritative while inverting a biological relationship or misattributing a causal chain is not a productivity gain. It is a patient safety risk, an audit failure, or both.
Anuraag Saini is a Senior Scientist and QSP Expert at Boehringer Ingelheim, a global pharmaceutical company where he supports quantitative systems pharmacology across major therapeutic areas and builds AI solutions for drug development workflows. He also founded FireQSP, an AI-augmented QSP modeling platform, and serves as Secretary of the Inflammation and Immunology working group at the International Society of Pharmacometrics. For Saini, the concern is not that frontier models lack capability. It's that capability without traceability creates a new class of risk in sectors where errors carry consequences far beyond a failed deployment.
"Plenty of companies will never be AI-first; take production, or pharma, or oil. LLMs can solve real inefficiencies there, but Mythos is not going to sit at the top of the organization and direct every move. That’s not going to happen," said Saini.
When confidence hides inversion: Saini described a failure pattern specific to biological systems. Cell A interacts with Cell B through a negative feedback loop, a foundational relationship in disease modeling. If an LLM reads a research paper and outputs that relationship as a positive feedback loop instead, the inversion looks perfectly coherent on the surface. "You cannot have your glucose affecting your insulin in the wrong direction," Saini said. "That linearity really matters. The way the events are getting triggered really matters." In drug development, where models inform dosing and efficacy predictions, an inverted relationship can propagate through an entire workflow before anyone catches it.
Ground truth as infrastructure: Saini's team addressed this by treating curated databases as the verification layer. LLMs extracted relationships from research papers, and outputs were back-verified against sources like Reactome and KEGG. "Anywhere you see a discrepancy, flag it," he said. "You are creating your own private database which is tested on multiple accounts, one from the databases, another from the biological research papers." That approach turned validation into a systematic process rather than ad hoc review.
The architecture Saini described reflects a principle that extends beyond pharma: keep LLMs out of the critical path unless their contribution is scoped, auditable and reversible.
Scoping to reduce stochasticity: "We are not letting LLMs touch the entire work," Saini said. "The margins of error get diminished quite a lot." His reasoning is practical. When a model operates across an entire complex problem, traceability collapses. "If something goes wrong, how do you really debug it? You won't have a workflow to be able to debug anything, because you are then completely working in a stochastic environment," he said. Narrowing the scope to a specific sub-problem, such as defining the insulin-glucose relationship within a diabetes model, kept the stochastic contribution manageable and the audit trail intact.
Regulation as the constraint: Saini pointed to regulatory frameworks around AI in drug development as the binding constraint that most AI deployment conversations underestimate. Even as deregulation shifts compliance burdens toward enterprise IT, the core expectation remains: prove the system works, prove it is safe, and prove you can trace the decision. "The areas which are still dominated by rules and regulation, those are the areas where unless the regulatory authorities have 100% confidence, things will be very difficult," Saini said. That confidence does not yet exist, and the gap between benchmark performance and regulated-environment requirements remains wide.
The broader signal Saini identified is structural. The companies building frontier models have recognized that life sciences cannot be solved by general-purpose architecture alone. OpenAI launched GPT-Rosalind as a dedicated life sciences reasoning model. Anthropic brought the CEO of Novartis onto its board. Novo Nordisk partnered with OpenAI to integrate AI across drug discovery and manufacturing. These moves confirm what Saini saw from the practitioner side: the sector requires domain-specific investment, not just bigger models, and human expertise remains the anchor that keeps AI outputs tied to reality.
For CIOs and technology leaders in regulated enterprises, the takeaway is not to avoid frontier models but to deploy them with the same discipline applied to any other component in a validated system: scoped inputs, verified outputs, auditable decisions and a human who owns the result. "You have to have rigorous guardrails and benchmarks to make sure that things are as accurate as possible," Saini said.




