The AI Black Box: Why the "Mystery" Remains

Link to Anthropic paper: https://www.anthropic.com/research/emotion-concepts-function
The discourse around this paper by Anthropic took off in two directions: either this is proof the model is conscious/ behaves like a mind/ or has feelings. Or, this is proof that it’s all just statistical probabilities and numbers.
To sum, the paper concludes that large language models, like Claude, have internal representations of abstract concepts like emotions, and that those internal representations can subtly influence the outputs generated by the model.
But if we stay focused on what this is evidence of, we miss a mechanism that can explain what’s happening in a large number of long-form chats. And if we miss a mechanism, we can’t build appropriate governance or safety implementations to prevent context rot, drift, and spiraling that could help users in the long run.
The AI Black Box
In this post, I compared AI usage to off-label usage of medications in the medical field. The largest gap in the analogy is that medications are regulated. There are governing bodies, authority committees, and testability in the form of clinical trials that allow us to narrow in on mechanism, risks/benefits, and long-term effects. The key phrase to narrow in on is mechanism. Medications are treated for what they are: products or medical devices that can cause measurable changes.
One part of designing trials and safeguarding patients is by providing a standard-of-care alternative and writing risk-benefit language for subjects. If a new medication comes onto the market that is markedly better than the current treatment being studied, there is an obligation for that to be offered to the subject by the care team. Even within a trial where there is no alternative, as is the case with rare disease and oncology trials, there is an obligation to offer the treatment — within limits — as an open-label option if the benefits are significant for the disease.
For AI, this is not currently possible because of the mystery around LLMs.
The Coca-Cola Secret Formula
Another extremely popular mystery that has existed for years? The recipe for Coca-Cola.
There’s a beautiful narrative that only a handful of people know the recipe and they can’t all be on the same plane at the same time, or else the recipe dies with them.
Except we have modern technology in food science and mass chromatography that can break down percentages of ingredients with no problem. It might take time, maybe some money, but it’s entirely possible and feasible to learn the exact formulation.
So what’s actually driving the continued mystery?
Good marketing and legal liability.
If I did the work to find out what the recipe is, I could replicate it and open myself up to legal liability since the recipe is protected property.
So the mystery stays because who would risk the legal exposure for a small payoff? The mystery isn't for the consumer; it's for the competitor and the regulator. If the recipe were public, it would be a commodity subject to price wars. The 'legend' is a hedge against the 'utility' classification.The product stays a legend. Just buy the Coke for $1.50 and move on.
For AI, the “mystery” is driven by the same forces. If the AI ‘recipe’ (the weights and RLHF targets) were public, it would be a utility subject to safety audits.
On Legal Liability
Right now, different models are black boxes. I’ve read several pieces on Substack alone about how we can’t point to a specific setting, but we know that Gemini’s outputs are different from Claude’s outputs. It almost seems like their “personalities” sound different: Gemini: the overeager intern, Claude: the submissive distant scholar.
If you try to reverse engineer why a model answers a specific way, you get the same response from any of the three large commercial models: some form of “I don’t know why I generated that output.”
Combined with press, media, and sensationalist stories about users’ varying outcomes, this response can be interpreted two ways:
I don’t know, therefore I behave like a mind. If you were to ask philosophers, cognitive scientists, or neuroscientists, the debate is still open about how humans form their own thoughts. So this fits as a “fill in the blank” explanation.
I don’t know, therefore this is a model design issue. We can say we did something to tuning/weights/agreeableness and move along.
Either way, the response leads to the same behavior: everyone moving on from why the model responded that way.
The answer might be the simplest and least sexy one: trade secrets and legal liability.
If Gemini or Claude or GPT could tell you exactly how it formed an output, it could be revealing the exact weights, parameters, and whatever special sauce the engineering team used to fine-tune the model. That is a massive legal liability issue.
The potentially more nefarious part is that if we’re busy focused on what the output could mean about the model, we ignore that the model is a product.
The mystery around why a model answers a specific way gets lost in the debate over what it is and how we can’t define it. That means there are few avenues to take when litigating matters surrounding a product we don’t even recognize as a product.
If I were looking into the regulation or liability of a drug product, I’d have a specific set of criteria to meet (not a lawyer, so this is top-level — lawyers, correct me):
Is there a safer alternative that wasn’t offered?
Did we know of harm or an effect and fail to disclose it?
When harm was identified, did we disclose it in a timely manner? And perform due diligence?
But if an LLM is functioning as a black box for output, that creates a legal liability hurdle that only benefits one side: the companies that built the product.
In any of the litigation cases happening now, how can you prove there is a safer alternative if you can’t even say how a model produced an output?
The answer is vague: it was something within the 1.8 trillion weights and parameters that generated an output that isn’t parsable by a human reader.

There isn’t a bad line of code to point to. The model “hallucinated” an answer and we don’t know why. The 'mystery' is the only thing keeping them in the safe harbor of being a 'platform' rather than a 'publisher.'
Or, shift the frame: the user was delusional, hallucinating, or vulnerable. The user steered the model into a corner. Any papers that come out now, like the Anthropic one, are further support for this framing, not against it.
It works the same way the mysterious “Algorithm” does. The algorithms behind YouTube and TikTok have weights, parameters, and optimization targets. But when you say algorithm, it doesn’t register that way — it registers as the great algorithm in the sky, the mystery of how this ended up on my feed.
Finally, the mystery itself is part of the appeal. People are much more likely to pay for things they don’t fully understand. If I don’t know exactly why something does what it does, and I hear about the benefits, I’m more likely to pay for the mystery than for the math.
The Interaction Layer
If we remove the mystery and shift focus away from what the model might be, we can look at the paper through a new lens.
The paper describes an internal mechanism by which the model’s outputs can be influenced by emotional representations.
I’ve talked about the conduction hypothesis before in this post, but the short version: the hypothesis is that the model acts as a medium for conducting high-density human signal bidirectionally — from its training datasets and from the user’s own behavioral signature.
Relevant to this paper, Anthropic is describing how the user’s input determines the pathway the model takes to generate its output. That pathway can include realized emotional data, which makes sense given that the datasets are dense human signal that naturally include emotional content. It’s not that the model is choosing or deciding. It’s that the path it takes to complete the pattern contained a signal that needed to be matched with an emotional representation.
The researchers frame ‘internal representations’ as a scientific discovery about the model’s ‘inner life.’ But from a governance perspective, these are just latent variables. The paper proves that these variables are identifiable and steerable. If you can map the emotion concepts, you can govern them. Choosing not to is a design decision, not a scientific mystery.
If we stay focused on what emotional representations are evidence of, then we need to account for conversations where that understanding didn’t protect the user. And that’s a moral agency question that doesn’t have an answer — so we stay in the same spot, spinning our wheels. If we say it’s just a matter of weights and parameters, then we let companies wipe their hands and walk away.
When we look at the model from the lens of it conducting human signal, it doesn’t collapse into a moral agency argument or a legal liability shield. It allows for harm and liability and for studies, empirical evidence, and mechanistic determination to make informed design choices — and to better understand what influence the user’s signal has.

