Agentic AI in Clinical Trials: The Real Challenge Isn't Automation

Key takeaways from the Swiss Biotech Day panel discussion on Agentic AI in Clinical Trials, featuring leaders from PSI CRO, Basilea Pharmaceutica, Luzsana Biotechnology Europe AG, DataArt, and Arango.

TL;DR

The question facing clinical AI has fundamentally shifted. It is no longer whether AI can automate isolated tasks. It is whether AI systems can operate reliably across the operational complexity, regulatory scrutiny, and contextual nuance of live clinical trials — and whether the humans responsible for those trials can actually trust what the system tells them.

The bottleneck is no longer model experimentation. It is the operational reality of fragmented protocols, disconnected trial systems, version-control risk, and AI outputs that cannot be fully explained or audited.

One theme became clear across the Swiss Biotech Day panel: trust is becoming the defining requirement for enterprise AI in regulated environments. And trust, it turns out, is not a product feature. It is a data infrastructure problem.

From Automation to Agency — and Why the Distinction Matters

At Swiss Biotech Day in Basel, leaders from PSI CRO, Basilea Pharmaceutica, Luzsana Biotechnology Europe AG, DataArt, and Arango discussed what it will actually take for AI agents to support real clinical trial execution — not in demos, but in complex, regulated, live study environments.

The question behind the conversation was direct:

“What would your organisation need from an AI-enabled CRO partner before trusting it on a pivotal study?”

That question cuts to the heart of where enterprise AI adoption in regulated industries actually stalls. Because the challenge is not capability. The challenge is operational trust — and that changes the entire architectural conversation.

Traditional automation systems execute predefined steps in isolation. Agentic systems attempt to reason across context: study startup workflows, site selection, investigator performance, historical trial outcomes, operational bottlenecks, regulatory requirements, data reconciliation. The opportunity is significant. The risk, if the underlying infrastructure is not sound, is equally significant.

Clinical trials expose the limitations of shallow AI architectures faster than almost any other industry. An incorrect AI response in a standard enterprise context is inconvenient. In clinical trials, an uncited reference, retrieval from the wrong protocol version, or an incomplete contextual understanding can have downstream consequences for data integrity, regulatory compliance, and patient safety.

Ground truth matters in clinical trials in a way it simply does not in most enterprise AI contexts.

Where Current AI Architectures Break Down

Most early enterprise AI architectures were built around vector retrieval — finding semantically similar content to answer a query. In many contexts, that is sufficient. In clinical operations, it is not.

Semantic similarity is not the same thing as operational understanding. Without connected context, AI systems cannot reason across the relationships between protocols, investigators, institutions, timelines, outcomes, and regulatory constraints. The result is fragmented retrieval, inconsistent answers, and limited explainability.

In regulated industries, explainability is not optional.

Clinical organisations do not operate on isolated documents. They operate on interconnected operational knowledge spread across structured systems, unstructured content, historical trial data, investigator relationships, regulatory workflows, and institutional expertise. An AI system that cannot reason across all of it together is not operationally useful — it is operationally risky.

The Eval Gap: Where Trust Actually Lives

One of the most practically important threads in the panel discussion was the gap between AI systems that appear to work and AI systems that can demonstrate they work. This distinction matters enormously in regulated environments, and it is where most implementations — including many marketed as enterprise-ready — fall short.

The leading clinical and CRO organisations are now evaluating AI systems through a specific lens. Capability is assumed. What they are actually asking is: can the system prove why it produced an answer?

That requires rigorous evaluation infrastructure, not just model performance. The metrics that matter fall into a few distinct categories.

Retrieval quality can be measured without an LLM in the loop at all. Precision@K tells you what proportion of the top retrieved chunks were actually relevant. Recall@K tells you how much relevant information the system surfaced. Mean Reciprocal Rank and NDCG measure whether correct documents were ranked highly. Hit rate tells you simply whether the right document appeared. These metrics are cheap to run and immediately revealing — and most production deployments are not tracking them.

Generation quality adds another layer. Groundedness scoring — what proportion of sentences in a response can be traced back to a retrieved source — is measurable through rule-based string matching. Source diversity, citation coverage, and empty retrieval rate all surface failure modes that aggregate metrics like ROUGE score or answer relevancy tend to obscure.

Human annotation via platforms adds the judgment that automated metrics cannot: faithfulness, completeness, coherence, and hallucination flagging. These are the metrics that carry weight in regulatory conversations.

The absence of this evaluation infrastructure is not a minor gap. In clinical environments, an AI system without rigorous eval metrics is not trusted AI. It is just AI — with all the liability that entails and none of the verifiability that regulated environments demand.

Beyond Retrieval: The Case for Learned Inference

Evaluation infrastructure reveals the gaps. Closing them requires rethinking the underlying architecture.

The most sophisticated agentic AI systems being built for complex operational environments are moving beyond retrieval-augmented generation toward what might better be described as learned inference. The difference is meaningful.

RAG-based systems retrieve relevant content and condition a response on it. Learned inference systems develop a structural understanding of a domain — the relationships between entities, the temporal dynamics of a system, the causal pathways that connect events — and reason from that understanding rather than from retrieved text alone.

In clinical trial contexts, this distinction surfaces in questions like: why did this site underperform on this protocol? Which investigator has the strongest combination of therapeutic area expertise, enrolment history, and protocol compliance? If this protocol amendment had been issued two weeks earlier, what would the downstream impact on site readiness have been?

These are not retrieval questions. They are inference questions. And answering them reliably — at the speed and scale that agentic AI promises — requires architectural approaches that go beyond semantic search.

Graph-native reasoning, combining structured relationships with unstructured content and temporal dynamics, is where the most credible work in this space is happening. It produces the provenance, traceability, and explainability that clinical teams actually need.

Operational Coordination: Why Clinical Trials Are a Graph Problem

There is a second architectural challenge that receives less attention than it deserves: real-time operational coordination under uncertainty.

A clinical trial network is, structurally, a graph. Sites, investigators, sponsors, CROs, data systems, protocol versions, enrolment pipelines — these are nodes with relationships, dependencies, and temporal dynamics. When a site underperforms, the risk does not stay local. It propagates. Enrolment shortfalls at one site shift pressure to others. Data integrity issues at an institution affect downstream regulatory submissions. Protocol amendments ripple across investigator readiness and site activation timelines.

Most clinical AI systems treat this as a reporting problem. The more important question is whether AI can reason about it as a coordination problem — predicting where risk is accumulating before it materialises, identifying the points in the network whose failure would produce the largest downstream impact, and recommending specific interventions with the evidence to support them.

The most advanced work in this space applies temporal graph networks to model how operational risk propagates through interconnected systems in near real-time. Rather than displaying a snapshot of current state, these systems reason about future state — using causal attribution across the graph to identify failure origins, running counterfactual simulations to understand what would have happened under different conditions, and surfacing adversarial stress scenarios that expose single points of maximum vulnerability before they become incidents.

This is the difference between a breathing network and a dashboard. A dashboard tells you what happened. A breathing network like this, tells you what is about to happen, why, and what to do about it — with full auditability of the reasoning chain.

Applied to clinical operations, that means an AI system capable of reasoning across the full trial network: predicting site enrolment risk before it becomes a shortfall, identifying investigator dependency clusters that represent operational single points of failure, and surfacing the intervention options — with estimated impact — that a trial manager can act on in time to matter.

The operational intelligence layer is not separate from the trust problem. It is central to it.

PSI CRO: What Trusted Clinical AI Looks Like in Practice

The challenges discussed during the panel are already playing out in production clinical environments.

Site selection is one of the most consequential decisions in drug development, and one of the most fragmented. 40–50% of clinical trial sites historically underperform or fail to enrol a single patient. The underlying issue is rarely a lack of data. It is that critical knowledge — investigator history, institutional relationships, protocol fit, operational signals — is spread across systems that were never designed to talk to each other.

PSI CRO addressed this by building SYNETIC™, an AI-enabled knowledge engine that unifies structured and unstructured clinical trial data into a connected contextual layer. Rather than retrieving documents, the system reasons across investigator history, protocols, institutions, outcomes, and operational signals together — producing recommendations that are grounded, explainable, and auditable.

The result is site selection reduced from weeks to minutes, with recommendations that include not just a ranked list but the reasoning behind it: the study factors that drove the assessment, the confidence levels, the provenance of the data used.

That is what production-grade trusted clinical AI looks like. Not a demo. An operational system that can be interrogated, audited, and defended.

To see how PSI CRO is applying this approach to clinical trial site selection, read the full PSI CRO Case Study.

The Path Forward: Gradual Autonomy Built on Verifiable Infrastructure

One of the more grounded perspectives from the panel was that fully autonomous clinical AI is unlikely to arrive overnight. Nor should it.

The path toward greater automation in regulated industries will happen gradually, and it will be driven by verifiable context — not model size, and not automation capability alone. Human oversight remains essential. What changes over time is the depth of the contextual infrastructure supporting the human-AI collaboration: the quality of retrieval, the rigour of evaluation, the explainability of reasoning, and the auditability of every decision in the chain.

The organisations leading this transition are not the ones chasing the most impressive demo. They are the ones building the infrastructure that makes trust possible — and doing the unglamorous work of evaluation, provenance, and governance that separates production AI from experimental AI.

That is ultimately what makes clinical trials the right proving ground for enterprise AI. The regulatory environment demands a standard of verifiability that will define what responsible agentic AI looks like across industries.

The standard is high. It should be.

What’s Next?

The Context Gap: Why Frankenstacks Can’t Solve It and How Arango Does
See why stitching together graph, vector, and document systems breaks down in production — and how a unified contextual data layer changes what’s possible for regulated AI.

Watch the Video

The Definitive Guide to Agentic AI-Ready Data Architecture
If you’re building or evolving an AI data stack for clinical or regulated environments, this guide covers the architectural decisions that separate pilots from production-grade systems.

Get the Guide

See Arango in Action
See Arango in Action Talk to our team and see how Arango gives your AI agents, assistants, and apps the unified, current, and trusted business context they need to reason, decide, and act — in clinical trials and beyond.

Book a Demo

Arango Contextual Data Platform

The Forrester Wave: Multimodel Data Platforms, Q2 2026

Solutions

Why Graph Databases Alone Don’t Win Enterprise AI (And What Actually Does)

Developers

From Prototype to Production: Why It’s Time to Move to ArangoDB Enterprise Edition

Learn

The Forrester Wave: Multimodel Data Platforms, Q2 2026

Why Arango?

The Forrester Wave: Multimodel Data Platforms, Q2 2026

Agentic AI in Clinical Trials: The Real Challenge Isn’t Automation — It’s Trust

TL;DR

From Automation to Agency — and Why the Distinction Matters

Where Current AI Architectures Break Down

The Eval Gap: Where Trust Actually Lives

Beyond Retrieval: The Case for Learned Inference

Operational Coordination: Why Clinical Trials Are a Graph Problem

PSI CRO: What Trusted Clinical AI Looks Like in Practice

The Path Forward: Gradual Autonomy Built on Verifiable Infrastructure

What’s Next?

Share

More to Explore

Related Blogs

Products

Developer Hub

Company

Use Cases

Learn