In this post, we examine Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?, by Maxime Méloux, François Portet, Silviu Maniu, and Maxime Peyrard. In this work, the authors explore mechanistic interpretability: the act of assigning an algorithmic explanation to the input-output mapping learned by a neural network. All direct quotes are from the manuscript.

The authors identify two factors motivating mechanistic interpretability. Namely,

  • Observations of trained networks: “Mechanistic interpretability rests on the key assumptions that a neural network’s behavior can be explained by a simpler algorithm than the full network, and that a sparse subset of the network executes this algorithm. Previous research has given support to these assumptions: pruning studies and the lottery ticket hypothesis suggest that networks are often overparameterized, and only a fraction of neurons and connections are critical to the final performance.”

  • Human desire for narrative: “The assumption of explanatory unicity - the idea that there exists a single, unique explanation for a given phenomenon – is not only implicit in the practice of mechanistic interpretability but also rooted in human cognitive and psychological tendencies. Humans demonstrate a cognitive preference for coherent explanations that integrate disparate observations into a unified narrative. This preference aligns with the psychological need for cognitive closure… Multiple incompatible explanations disrupt coherence, leading to ambiguity and a sense of unresolved understanding.” This is in tension with the philosophy of science, wherein “explanatory pluralism acknowledges that the world is too complex to be fully described by a single comprehensive explanation.”

The authors do not mention a third benefit to mechanistic interpretability:

It is often taken for granted that an interpretable explanation can be given to a neural network, even moreso that the explanation is unique. While the desire for a unique explanation is very human, this paper is a proof-by-counter-example that such an explanation always exists. In particular, they train a small multi-layer perceptron (MLP) to learn a mapping that approximates an XOR on two binary inputs:

$$ \begin{array}{cc|c} x_1 & x_2 & \mathrm{XOR}(x_1,x_2) \\ \hline 0 & 0 & 0 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \\ \end{array} $$

The network is small enough that the authors can exhaustively search over mechanistic interpretations of the learned mapping. Doing so reveals explanatory pluralism: multiple valid explanations coexist, each capturing a different computational abstraction of how XOR is implemented.

The paper also reviews how mechanisms are typically inferred in practice, distinguishing between what-then-where and where-then-what methodologies. They show, using the XOR example, that neither approach yields a unique mechanistic description. They further explore how this non-uniqueness depends on hyperparameters such as learning rate, network size, and task structure. I will not go into these details here; see the paper for a full treatment.

Instead, I will focus on their XOR example. The authors provide a Python notebook illustrating it here. I will modify their setup slightly by defining the network analytically rather than training it.

The Professor, the students, and the XOR

A professor trains an MLP to perform XOR. The network accepts a vector of two binary values, ${\bf x}\in\{0,1\}^2$, as input and outputs a scalar value, $y\in\mathbb{R}$. The network has a hidden layer of three neurons, ${\bf h}\in\mathbb{R}^3$. The network computes:

$$ \begin{align} {\bf h} &= \sigma(W_1{\bf x}+{\bf b}_1) \quad \in \mathbb{R}^3\\ y &= \sigma(W_2{\bf h}+b_2) \quad \in \mathbb{R}, \end{align} $$

where

$$\sigma(x) = (1+e^{-\lambda x})^{-1}$$

is a sigmoid nonlinearity and $\lambda=25$ is some large positive number. As $\lambda$ becomes large, the output of the sigmoid becomes approximately binary itself. In this limit, it becomes possible to approximate a number of logical gates with this activation function. For binary inputs and large $\lambda$:

$$ \begin{align} \sigma(x_1+x_2-1/2) &\approx \mathrm{OR}(x_1,x_2) \\ \sigma(x_1+x_2-3/2) &\approx \mathrm{AND}(x_1,x_2), \end{align} $$

and so on. We will use these definitions below.

Upon training the network to compute $y = \mathrm{XOR}(x_1,x_2)$, the professor finds that the weights converge to

$$ \begin{align} W_1 &= \begin{pmatrix}1 & 0 \\ 0 & 1 \\ 1 & 1 \end{pmatrix} & {\bf b}_1 &= \begin{pmatrix}-1/2 \\ -1/2 \\ -3/2 \end{pmatrix} & W_2 &= \begin{pmatrix}1 & 1 & -2 \end{pmatrix} & b_2 &= -1/2. \end{align} $$

The state of the network can be summarized across all inputs as:

$$ \begin{array}{cc|ccc|c} x_1 & x_2 & h_1 & h_2 & h_3 & y \\ \hline 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 & 0 & 1 \\ 1 & 0 & 1 & 0 & 0 & 1 \\ 1 & 1 & 1 & 1 & 1 & 0 \\ \end{array} $$

Clearly, it implements XOR. But how is this computation facilitated? The professor gives the weights to her two graduate students, Alice and Bob, and asks them to determine whether they admit a mechanistic interpretation. A week later, both students return, each with their own interpretation of the network.

Alice

Alice notes that the hidden layer encodes two logical signals in its population activity:

$$ \begin{align} a_1({\bf x}) = h_1({\bf x})-h_3({\bf x}) &= \mathrm{AND}(x_1, \mathrm{NOT}( x_2))\\ a_2({\bf x}) = h_2({\bf x})-h_3({\bf x}) &= \mathrm{AND}(\mathrm{NOT}( x_1), x_2). \end{align} $$

She derives that the output neuron performs an OR on these two signals:

$$ \begin{align} y &= \sigma(W_2{\bf h}+b_2)\\ &= \sigma(h_1+h_2-2h_3-1/2)\\ &= \sigma(a_1 + a_2 -1/2)\\ &= \mathrm{OR}(a_1, a_2). \end{align} $$

Thus, she concludes, the network has implemented

$$ y = \mathrm{XOR}(x_1,x_2) =\mathrm{OR}(\mathrm{AND}(x_1, \mathrm{NOT}( x_2)),\mathrm{AND}(\mathrm{NOT}(x_1), x_2)). $$

It first computes ‘$x_1$ but not $x_2$’ and ‘$x_2$ but not $x_1$’, and outputs $1$ if either is true.

Bob

Bob has a different explanation. He notes that the hidden layer encodes two logical signals in its population activity:

$$ \begin{align} b_1 &= 1-h_3({\bf x}) = \mathrm{NAND}(x_1, x_2)\\ b_2 &= h_1({\bf x})+h_2({\bf x})-h_3({\bf x}) = \mathrm{OR}(x_1, x_2). \end{align} $$

He derives that the output neuron performs an AND on these two signals:

$$ \begin{align} y &= \sigma(W_2{\bf h}+b_2)\\ &= \sigma(h_1+h_2-2h_3-1/2)\\ &= \sigma(b_1+b_2 -3/2)\\ &= \mathrm{AND}(b_1, b_2). \end{align} $$

Thus, he claims, the network has implemented

$$ y = \mathrm{XOR}(x_1,x_2) =\mathrm{AND}(\mathrm{NAND}(x_1,x_2),\mathrm{OR}(x_1,x_2)). $$

It first computes $\mathrm{OR}(x_1,x_2)$ and $\mathrm{NAND}(x_1,x_2)$, and outputs $1$ if both are true.

Moral of the story

The students interpretations are summarized here:

Both students are correct in their interpretation. And yet, these interpretations of mechanism are inconsistent with each other: in one, the mechanism of the output neuron is an OR gate, and in the latter it is an AND gate. The existence of a true mechanism is subjective, and depends on ones innate bias towards which features, $\{a_1, a_2, \mathrm{OR}\}$ or $\{b_1, b_2, \mathrm{AND} \}$, provide greater cognitive utility.

Humans seem to have an implicit bias for some interpretations over others. I posit that humans’ bias toward cognitive closure is actually a bias towards minimizing computational complexity: by favoring reusable building blocks that appear across many explanations—and by favoring explanations that use building blocks one already knows—humans are able to build models of the world with minimal cognitive overhead.

The moral of the story is that representing a function as a composition of modular pieces is a mathematically non-unique process. Consequently, so are the narratives we spin about how neural networks function. That is not to say that mechanistic interpretability is not useful. On the contrary, having a narrative that predicts function helps researchers keep pace with the ever-growing complexity of modern AI. It is simply that the narrative, whatever it may be, is likely one of many. In any case, interpretability remains pragmatic: it may not yield a uniquely correct narrative, but it can still reveal when a network has learned the wrong lesson.


Code for this network can be found here:

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

class XORNet:
    def __init__(self):
        self.W1 = np.array([
            [1.0, 0.0], 
            [0.0, 1.0], 
            [1.0, 1.0] 
        ])
        self.b1 = np.array([-0.5, -0.5, -1.5]).T
        self.W2 = np.array([[1.0, 1.0, -2.0]])
        self.b2 = np.array([-0.5])

    def forward(self, x, scale = 100):
        h = sigmoid(scale*(self.W1 @ x + self.b1[:,None]))
        y = sigmoid(scale*(self.W2 @ h + self.b2[:,None]))
        return h,y

# Evaluation
net = XORNet()
inputs = np.array([
    [0, 0, 1, 1],
    [0, 1, 0, 1]
])
H,outputs = net.forward(inputs, scale=25)
expectation = inputs[0,:] ^ inputs[1,:] # compute XOR

## Print out result
table = np.vstack((inputs, H, outputs)).T.round(decimals=2)
headers = ["x_1", "x_2", "h_1", "h_2", "h_3", "y"]
print("".join(f"{h:<8}" for h in headers))
print("-" * 48)
for row in table:
    print("".join(f"{val:<8.2f}" for val in row))