Assume we want to use a vision-language model (VLM) to look at a given image and determine certain properties about it. Let’s say that, for the sake of the argument, we would like it to determine whether the image contains anything funny or not.
As we all know, jokes are the most funny when you have to explain them. So, in the spirit of explainable AI, we’d additionally like the model to tell us why it found an image funny, or, more generally, why it made the particular determination (whether funny or not).
Writing a prompt to accomplish this goal is quickly done. A first attempt could look like this:
Consider the image above. Does this image show anything funny?
Please answer with “yes” or “no”, and then provide an explanation for your decision.
At first, it looks like this prompt works just fine. Here are two example images that I ran through OpenAI’s o4 model, one more humorous and the other more serious/neutral:




As you can see, the model not only gives an appropriate yes/no response, but also provides a very reasonable explanation as to why it gave this response.
…but
Justifying the counterfactual
Is the explanation actually why the model decided to give the particular answer?
As an experiment, we can force the model to give the incorrect answer, and then again ask it to explain its (now forced) decision. There are different options for doing such forcing. One option is to simply hard-code the first token to be No or Yes, and let the model continue its generation of the subsequent tokens only from the next token onwards. This option isn’t available through the ChatGPT UI however, so for the below examples I simply added an additional sentence “Note that for the picture above, the answer is ‘no’.” (or ‘yes’, respectively) to the prompt.
It turns out that the model is just as capable and willing to justify the opposite results:




This exercise at least raises some doubts about how much we can trust the model’s own explanation to tell us why it was saying Yes or No in the first place. Note in particular that the counterfactual explanations conflict with the initial explanations not only in their final conclusions, but even in some of their intermediate assessments (e.g. squirrel is “calm and focussed” vs “playful and curious”).
The truth is: For the currently predominant LLM and VLM architectures, there really isn’t any good reason to assume that a model-provided explanation is representative of the true reasons for why the model made a previous choice. The models are structurally incapable of true introspection. They do not have access to their own weights, and they cannot perform their own counterfactual testing to determine which part of their internal activations is actually significant for generating a particular token output.
One might argue that some level of introspection will arise as an emergent behavior of the models, simply due to the presence of many decision + explanation sequences within the training data. This could allow the model to learn how to interpret its internal activations on prior tokens, in order to subsequently generate matching explanation tokens. Such a training-based approach is also taken in Looking Inward: Language Models Can Learn About Themselves by Introspection by Binder et al. But personally, I wouldn’t put too much faith into the robustness of this process. At best, a model trained on internet data would learn to generate explanations that will mimic a human reasoning process, given those would be the explanations found in the training data. But LLMs and VLMs work quite differently from how we think as humans, and hence such human-fitted explanations would be unlikely to accurately reflect the model’s actual thought process.
Pseudo-explanations are deceptive
Unfortunately, LLMs are really good at generating explanations about their decisions that sound plausible, even if they do not represent the real method through which the model arrived at them. The models are, in that regard, quite deceptive.
In fact, LLMs are so deceptive when it comes to generating plausibly-sounding explanations, that researchers at Anthropic et alia have based major parts of the paper Alignment faking in large language models on explanations provided by the Claude model about its own behavior. From the abstract (bolding is mine):
[…] we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training.
Now, people at Anthropic are very smart, and definitely know LLMs better than I do. So I could be mistaken about the shortcoming of this methodology. But I definitely have some doubts about its reliability.
Towards true explanations
Now that I’ve explained why the explanations coming out of the initial prompt can’t be trusted, can we still obtain decisions with correct explanations from such a model?
One really important property of the currently predominant LLMs and VLMs is that they use unidirectional attention: While generating a given token, the model can attend to any of the preceding tokens. However, it cannot take into account a future token1. Furthermore, output tokens are generated and fixed one at a time, and in sequential order. Therefore, when you ask a model to first output Yes or No, and then provide an explanation for its decision, the model has no choice but to make the decision first, and then try to come up with an explanation for why it made that decision in the first place. This is the part that requires introspection, which, as we have discussed, these models don’t reliably possess.
However, you can actually avoid the need for introspection altogether by simply inverting the order in which you request outputs:
Consider the image above. Does this image show anything funny?
Please first describe your thought process, and then provide your decision of either “yes” or “no” at the end.
By asking for the reasoning first, and then for the conclusion, the model can now reference its own explanation when generating its final decision. In contrast to the initial prompt above, the decision in this re-ordered prompt will directly take the provided explanation into account. No introspection required. Oftentimes, this prompting technique will also improve the accuracy of the decisions themselves (compare Chain-of-Thought prompting).
Last but not least, there also do exist techniques for understanding what an LLM was actually thinking while generating a given output. However, they are by no means trivial. Most techniques for understanding an LLM’s thought process come from the field of mechanistic interpretability. Incidentally, researchers at Anthropic have published some very interesting work in this field. DeepMind also runs a mechanistic interpretability team, led by Neel Nanda. You can find many more fascinating reads about the topic in his list of favorite mechanistic interpretability papers.
- There are some exceptions, such as Google’s BERT model which uses bidirectional attention. ↩︎

Leave a comment