The people who build today’s most capable artificial intelligence can describe exactly how they train it. They can write down the architecture, the optimisation procedure and the objective the system is rewarded for meeting. What they cannot do, in any complete way, is explain how a finished model arrives at many of its particular answers. The training is understood. The thing the training produces is not.
This is not a sensational claim, and it is not the same as saying nobody knows how AI works. It is a narrower and stranger point. We understand the process that grows these systems far better than we understand the systems themselves.
What is understood, and what is not
A large language model is not programmed in the ordinary sense. Engineers specify a network architecture, usually a transformer, and an objective, usually predicting the next piece of text, and then run an optimisation process over enormous amounts of data. The model’s internal settings, its billions of numerical parameters, are not written by anyone. They are found by the training process.
The result is closer to something grown than something built. The capabilities that emerge, and the internal representations the model uses to produce them, are a product of the optimisation rather than a design anyone laid out in advance. Reading a trained model’s weights tells you almost nothing directly, in the way that reading out the strengths of every connection in a brain would not tell you what the person is thinking. The recipe is legible. The dish is not.
The field trying to open the box
There is a research programme aimed squarely at this gap, known as mechanistic interpretability, which tries to reverse-engineer the internal computations of neural networks into human-readable terms. Its modern form is closely associated with the researcher Chris Olah and the teams around him at Google, OpenAI and Anthropic. The early work, on image models, identified individual artificial neurons that appeared to detect specific concepts, before the approach moved to language models.
The progress since has been real and partial in equal measure. Using a technique called sparse autoencoders, researchers have pulled out interpretable internal “features” corresponding to concepts, and demonstrated control over them, most memorably when Anthropic produced a version of its model fixated on the Golden Gate Bridge by amplifying a single feature. In 2025, the same group introduced attribution graphs, a way of tracing the pathway a model takes from input to output through interpretable components, work that IBM researchers have connected to their own interpretability efforts. A 2026 survey in ACM Computing Surveys, Bridging the Black Box, lays out the field’s methods across neurons, circuits and whole algorithms.
What none of this yet amounts to is a full reading of a frontier model. The teams behind the circuit-tracing work are explicit that their methods capture only a fraction of a model’s computation, even on simple prompts. Some internal circuits turn out clean and interpretable. Others are tangled and resist explanation.
A familiar pattern, sharpened
Using a tool well before understanding it is not new, and it is worth resisting the temptation to treat AI as the first case. The steam engine was in widespread industrial use for decades before thermodynamics explained what it was doing. Aspirin was prescribed for most of a century before its mechanism was worked out in the 1970s. General anaesthesia is administered millions of times a year, and how it produces unconsciousness is still not fully settled.
In each case the technology was reliable enough to trust on the evidence of its results, while the explanation lagged behind. AI fits that pattern, with two differences that make the gap more pointed. The tool is general rather than specialised, applied across writing, coding, medicine, law and research at once. And it is being deployed into consequential decisions at a speed and scale that leaves little time for the understanding to catch up.
How much the gap matters is itself disputed
This is where careful people disagree. Dario Amodei, the chief executive of Anthropic, argued in an essay titled The Urgency of Interpretability in 2025 that understanding what models are doing internally is a task of overriding importance, and framed the situation as a race in which capability is currently ahead of understanding. On that view, deploying systems far more capable than today’s without being able to inspect their reasoning is a serious risk.
Others are less convinced that full mechanistic understanding is either necessary or achievable. They point out that we manage many complex systems through behaviour rather than internals, and argue that rigorous testing, evaluation and red-teaming may be a more practical guardrail than a complete circuit-level account that may never arrive for systems this large. Both positions take the opacity seriously. They differ on what follows from it.
The point on which they tend to agree is the factual one underneath the debate. Capability is, for now, running ahead of explanation, and the question is whether that gap is merely uncomfortable or genuinely dangerous.
What to watch
The realistic test over the next few years is not whether anyone will fully understand a frontier model, which looks unlikely, but whether interpretability can deliver enough partial understanding to be useful: enough to catch a deceptive or unsafe pattern before deployment, or to explain a specific failure after it. Some labs have begun running interpretability checks as part of pre-release safety analysis, which is a concrete sign of the tools maturing from curiosity into practice.
The broader thing to track is the direction of the gap. If understanding advances faster than capability, the opacity becomes a temporary stage in a maturing technology. If capability keeps pulling ahead, then a growing share of what societies rely on will continue to rest on tools that work for reasons their own makers cannot yet fully give.