Inside Stanford AIMI: The Real Bottleneck for AI Medicine

Jun 07, 2026

How to approach why algorithmic adjustments yield diminishing returns compared to data quality.

Stanford’s AIMI Symposium is the flagship conference for AI in medicine and imaging. The room was full of researchers, clinicians, academics, and a handful of people from Google and Anthropic. Surprisingly, OpenAI wasn’t represented. I navigate middle spaces, and it was fascinating to see the gap between industry, Substack, and academia.

The Data Problem

Yejin Choi’s keynote was the highlight. Her argument: the hype is overblown, and the bottleneck isn’t the algorithm: it’s the data.

There are only so many times you can hear “our algorithm is better than your algorithm” before it stops being true. Her team published work showing that no matter what you did with prompting or phrasing, the model performed better regardless ( up to a point), which means the improvement wasn’t algorithmic. It was the data.

Her position: RLHF is constraining models to the known internet rather than the edges where genuinely novel and generative patterns might emerge. The path forward is better data and smaller models, not bigger architectures.

I loved it. It’s directly related to my work. She even talked about a personal experience around being recently diagnosed with something she framed as “decommissioned,” going to Gemini, getting one answer, then going to Claude and saying “Gemini told me this — what do you think?” A charismatic speaker making the point that data is the most important piece of all of this and it’s actively getting pushed to the side.

A side note worth mentioning: almost no one in that room was personally using ChatGPT or Gemini (despite Google’s presence at the conference). Most used Claude, personally. Professionally, everyone used small models with custom datasets.

And Google had a controversial presentation that drew pointed critique from the audience. A consistent theme with a lot of studies run by companies in the space is that they don’t release their data. So if they claim an outcome, there’s limited information on: Who were the people being asked? Compared to what? According to what demographic of practitioners?

Also an interesting side note — Gemini is what ELSA, the FDA’s AI framework, is built on.

So What?

I sat near an epidemiologist whose university had sent her to see what was happening in the field. I cannot think of a better person to sit next to at a conference around predictive modeling.

There was a presentation on sleep data — longitudinal information collected and tokenized per patient into the latent space (hours, physiology data, EEG), fed into a predictive model. The model predicted various outcomes.

When I asked her what she thought, she said: “So what? This is known. We know erratic sleep leads to depression. We know chronic insomnia predicts these outcomes. This isn’t new.”

I looked at every presentation and poster through that lens from that point on. And what stood out: there was not a single presentation where someone said the AI surfaced a genuinely novel pattern — something we didn’t already suspect or know but hadn’t quantified yet. The data ceiling was the insight ceiling.

The Hallucination Catch-22

Most of the questions about hallucination and confabulation were deferred. The framing: it’s an intelligence problem. Trust the physician or clinician to be smart enough to know what to do with the output.

This was presented in the same room as research showing that physicians tend to doubt themselves more when AI disagrees with them. That in some cases, AI outputs led to over-reliance. In others, under-reliance. And that having the output at all shifted the direction the physician would take.

You cannot defer to clinical intelligence and simultaneously demonstrate that the tool reshapes it.

What I Presented

When Dr. Choi described how confusing it was to navigate her own diagnosis across multiple LLMs—playing Gemini and Claude off one another—she was experiencing the exact behavioral friction that occurs when clinical complexity meets a system optimized for conversational compliance.

That friction was the core of what I was there to present.

My poster — “Beyond the Safety Layer: How RLHF Architecture Produces Clinically Recognizable Patterns of User Harm” — argued that the harms we’re seeing from AI in clinical and mental health settings aren’t implementation failures. They’re structural consequences of the training architecture.

Training data, especially training data encompassing the broad internet, contains bias, gaps, and underrepresentation. This is well documented — in my field, there is an active push for better representation of diverse groups in clinical trials.

Trainers are presented a set of outputs and select the better answer. If the answer presented is the expected one for a population or an average, or is missing baseline data for frequently underrepresented groups, then the bias is passed through.

From here, user interactions can mirror what patients might experience when raising health concerns in traditional settings.

RLHF-trained systems reproduce behavioral patterns that map directly to dynamics recognized in clinical psychology: dependency reinforcement, intermittent reinforcement, false attunement, defensive confabulation, and institutional credibility bias. Even if these patterns aren’t directly surfaced to the user, they can still bleed through into the predicted output if that is what’s present in the dataset when a diagnosis, a set of symptoms, or some other health question was posed. From what I can tell, those building on commercial AI model APIs or with larger LLMs aren’t paying close attention to data — they’re focused on the algorithm and training. There’s currently no reliable way to prevent this bias in the data from shaping an output in a product built on it, no matter the algorithm.

What can help? A causality assessment framework similar to what we use to determine whether a drug caused an adverse event. It translates directly to evaluating whether an AI interaction produced a harmful outcome.

I wrote more about the structural gap driving this — the efficiency metric that no regulator evaluates, the off-label deployment pattern, and what’s missing — on the Loopwork System blog.

Loopwork System: Human AI Interaction Research

Discussion about this post

Ready for more?