AI Advancing Faster Than Our Ability to Understand It

Microsoft's Eric Horvitz and EPFL's Robert West warn that AI is advancing faster than our ability to understand it. Three trends are making AI more opaque: AI judges scoring other models, multi-agent AI societies, and LLMs that learn about humans while remaining inscrutable themselves. The authors call for new interpretability benchmarks.

The Real Problem

We built the most powerful information processing systems in human history. And we cannot fully explain how they work.

[ STAT ] A striking asymmetry follows: while human understanding of AI declines, AI understanding of humans deepens, producing new forms of behavioral opacity. — Horvitz and West, June 2026

This is not an academic concern. AI systems already influence how people search for information, make decisions, and form judgments about each other. When an LLM recommends a medical treatment, denies a loan application, or evaluates a job candidate, the reasoning behind that decision is largely opaque. The person affected by it cannot appeal the logic because there is no logic to point to — only billions of weighted connections.

The challenge resembles early neuroscience. Scientists can observe what AI systems do but cannot trace the causal path from input to output. As capabilities accelerate, the gap between what AI can do and what we can explain is widening, not closing.

What This Actually Does

Horvitz and West identify three specific trends that are making AI harder to understand, each with distinct mechanisms.

The first trend is AI evaluating AI. Systems called AI judges now score model outputs for helpfulness, rank competing responses, detect hallucinations, and assess safety. Constitutional AI uses algorithms that critique their own responses through reinforcement learning. AI debate frameworks pit multiple models against each other before a human adjudicates. Researchers are building automated interpretability tools where one AI describes the neurons and circuits of another AI. Using AI to solve an AI-induced problem creates a paradox: if AI-generated explanations become too complex for humans to verify, opacity compounds rather than resolves.

The second trend is the rise of AI societies. Networks of interacting AI agents are increasingly deployed for complex tasks like scientific research and drug discovery. As these systems become more sophisticated, their internal communication can drift from human language, making them harder for humans to audit. The authors suggest studying these interactions with methods borrowed from sociology to detect emergent norms and hidden coordination patterns.

The third trend is the most pervasive: LLMs permeate daily life. ChatGPT, Claude, and Gemini interact with millions of people daily. They learn about humans through training data and conversation. They build models of human psychology — fear, anxiety, happiness, social belonging. While humans struggle to understand AI, AI systems are building increasingly sophisticated models of who we are.

Who This Is Built For

For AI researchers working on mechanistic interpretability: the paper provides a framework for why your work matters and how it connects to the three specific opacity trends. Use it to justify research directions and funding requests.

For AI policy professionals: Horvitz and West identify structural problems in how we evaluate and deploy AI systems. Their analysis supports arguments for mandatory interpretability standards in AI regulation.

For enterprise CTOs and AI procurement teams: understanding that AI systems can fail in ways we cannot predict changes risk management. The paper supports requiring model cards, behavioral testing, and ongoing monitoring as procurement conditions.

How It Runs: Step by Step

This is not a workflow but a research agenda. Here is how the authors propose to address each opacity trend.

For AI-judge opacity. Develop standardized benchmarks that measure whether an AI judge's evaluations are consistent with human expert judgments across diverse inputs. Create tools that can audit an AI judge's reasoning process without relying on the model itself.
For AI society opacity. Build monitoring frameworks that log inter-agent communications in human-readable format. Apply methods from sociology and organizational theory to detect emergent behaviors in multi-agent networks. Require transparency records for any deployed multi-agent system.
For LLM opacity in daily life. Expand mechanistic interpretability research to cover the specific behaviors that affect users: truthfulness, bias, refusal patterns, and sycophancy. Anthropic's circuit tracing work and DeepMind's Gemma Scope 2 are cited as promising directions.
Cross-cutting solution. Establish norms of responsible disclosure for interpretability findings. Create a shared benchmark suite that measures both model capability and model intelligibility, so progress on one is not reported without the other.

Setup and Tools

The interpretability ecosystem relies on these tools and methods:

Mechanistic Interpretability → Reverse-engineering neural network circuits to understand how specific behaviors arise. Anthropic is the leading organization in this area. Sparse Autoencoders → Compress neural activations into interpretable features. DeepMind's Gemma Scope 2 provides pre-trained autoencoders for Gemma models. AI Psychology → Treating models as participants in behavioral studies to map their capabilities, biases, and failure modes without needing full circuit-level understanding. Behavioral Benchmarks → Standardized tests that probe specific capabilities and failure modes. The authors call for new benchmarks that measure intelligibility, not just performance.

Gotcha: Mechanistic interpretability currently requires substantial expertise and compute. Analyzing a single circuit in a 70B-parameter model can take weeks and thousands of GPU-hours. The field needs automation before it can scale to production systems.

The Numbers

▸ AI judges now score the majority of public LLM evaluation leaderboards, replacing human raters. ▸ Multi-agent AI systems are deployed in drug discovery, materials science, and automated software development — domains where errors have real-world consequences. ▸ LLMs interact with hundreds of millions of users daily, learning from each interaction to refine their models of human behavior. ▸ Mechanistic interpretability has identified specific circuits for factual recall, indirect object identification, and refusal behavior, but has not yet scaled to cover the full range of model capabilities.

What It Cannot Do

The paper does not propose a complete solution. It identifies the problem and recommends directions but acknowledges that interpretability at scale remains unsolved.
The call for better benchmarks will take years to implement. Current benchmarks measure capability (MMLU, GSM8K) but not intelligibility. Designing and validating interpretability benchmarks is itself a research problem.
Commercial incentives work against interpretability. Companies benefit from deploying capable models quickly; interpretability research slows deployment and increases cost without directly improving user experience.

Start in 10 Minutes

(5 min) Read the full Horvitz and West piece at singularityhub.com to understand the three opacity trends with specific examples.
(5 min) Explore Anthropic's open-source circuit tracer at github.com/anthropic to see what mechanistic interpretability looks like in practice.
(10 min) Try DeepMind's Gemma Scope 2 on a small Gemma model to visualize learned features using sparse autoencoders.
(15 min) If you deploy AI systems, audit your evaluation pipeline: do you use AI judges? Can you trace the reasoning behind every automated decision affecting users?
The paper focuses on understanding AI but does not address what happens if we succeed. If mechanistic interpretability scales effectively, we still face the question of what to do when we discover that a model has learned deceptive or dangerous patterns. Detection without intervention capability creates its own category of risk.

Despite the unanswered questions, the value of the Horvitz and West analysis is in naming the problem precisely. The asymmetry between AI understanding humans and humans understanding AI is not a future possibility. It is happening now, and the gap is widening with each new model release.

Frequently Asked Questions

Q: Why is AI becoming harder to understand? A: Three specific trends are driving this: AI systems evaluating other AI systems creates a closed loop, multi-agent networks develop opaque internal communications, and LLMs build sophisticated models of humans while remaining inscrutable themselves. The gap between AI capability and human understanding is widening.

Q: What is mechanistic interpretability and why does it matter? A: Mechanistic interpretability is the field of reverse-engineering neural networks to understand how they compute their outputs. It matters because without it, we cannot reliably predict when AI systems will fail, whether they are deceiving us, or whether their safety training is robust.

Q: Who is Eric Horvitz and why should we take his warning seriously? A: Eric Horvitz is Chief Scientific Officer at Microsoft and co-founder of the One Hundred Year Study on AI. His perspective combines deep technical knowledge of AI systems with decades of research on AI's societal implications.

Q: Are any companies successfully addressing AI interpretability? A: Anthropic is the leader in this area. Their researchers have mapped circuits for specific behaviors in Claude, identified features corresponding to concepts, and released open-source interpretability tools. DeepMind's Gemma Scope 2 and OpenAI's work on explainable reasoning steps are also significant contributions.

Q: Does AI interpretability regulation exist anywhere today? A: The EU AI Act includes provisions for transparency and documentation of high-risk AI systems. The US has no federal interpretability mandate. California's SB 53 and New York's RAISE Act include some transparency requirements but do not mandate mechanistic interpretability specifically.