Imagine an artificial organism living in a complex environment. The organism is not just a passive observer. It can decide which parts of its environment it wants to explore next. It might even be able to actively influence the world around it by taking various actions.

We want the organism to learn from its experiences and build an understanding of the world it lives in.

In this post, I’m going to sketch out a high-level architecture for active world model learning. I’ll then describe a thought experiment to investigate an apparent paradox that arises when trying to incentivize exploration, and give an intuition for how it can be avoided.

What it Means to Understand the World

While this is a philosophical question and there are probably many ways to answer it, I’ll focus on the following key aspects:

Understanding means having a mental model that is capable of:

  • Estimating the effects that taking a particular action would have, given a current state of the world
  • Predicting how the organism’s environment will change in the future

Such a “world model” would be an immensely valuable tool for our little organism to have. It will allow the organism to take intentional actions that change its environment to better align with its internal goals (whatever those may be). It would also enable the organism to predict and prepare for upcoming threats before they occur, thus acting proactively rather than merely reacting to what has already taken place.

Now that we know what a good world model looks like, how should we set up the artificial organism’s incentives to learn one? What should the loss functions of its neural networks be, and what actions should the organism take to best improve its world model over time?

Learning a World Model

A common way for training models that capture salient features of the world are auto encoders (Kramer 1991).

In the auto encoder paradigm, an artificial neural network is set up to encode its inputs (such as the organism’s sensory inputs at a given point in time) into a lower-dimensional internal representation. Simultaneously, another network is trained to reconstruct the sensory input from this internal representation. The loss function is given by how accurately the inputs can be reconstructed. This reconstruction loss incentivizes the network to build an internal representation that captures the most important information needed to accurately reconstruct the initial inputs. In essence, auto encoders are a form of data compression that learns to exploit regularities and structures in the input data in order to represent it more compactly.

This is a promising start, but unfortunately the bare auto encoder paradigm is entirely ignorant of two important aspects of our organism:

  • It has no notion of time, and therefore won’t help our organism predict what will happen in the future
  • It assumes a passive learning scenario, where inputs are simply fed into the training apparatus. It does not consider the fact that our organism has the ability to steer its attention and interact with the world directly.

Auto-encoders learn compact representations of the world at a given moment. But they lack a notion of time, and therefore can’t predict future changes or outcomes.

Adding Time to an Auto Encoder

We want to learn not just an instantaneous representation of the current sensory input, but a predictive model. One that can tell us how the world, and the organism’s internal representation of it, are likely going to change in the future.

To accomplish this goal, let’s add an additional artificial neural network module to our auto encoder. This new network will attempt to predict how the model’s internal representation is going to change, given its current value. We assign a loss depending on how well this network can predict a future internal state within some time horizon, the prediction error. Remember that the internal representation is already optimized to represent the instantaneous perception of the world as accurately as possible, per the auto encoder’s reconstruction loss. So by making predictions over the internal state, our organism will implicitly be able to predict its future sensory input as well.

While we’re at it, we can also give our organism a form of short-term memory. Presume that this memory allows the organism to store and later recall recent observations about the world. It can utilize these memories to achieve more accurate predictions of the future. For example, memory allows our organism to establish object permanence – remembering the location and presence of different objects, even when they are currently out of view. With object permanence, the organism can better predict what it’s going to see when its own movement, or the movement of an object, makes a previously occluded object visible again. Another good use of memory is to understand the velocity and direction of motion. This will help our organism better predict the future location of moving objects.

Adding a prediction model and short-term memory allow the organism to make predictions about the future.

Deciding What to Do

To complete our organism’s “brain”, we also need to add a decision module. The decision module will consider the current internal representation from the auto encoder together with the current state of the short-term memory, and then decide on an action for the organism to take in accordance with the organism’s current goals.

Once the organism has taken an action, this action will be added to the organism’s short term memory. By doing so, the previously introduced prediction network can utilize its knowledge of recent actions to better predict upcoming sensory inputs. For example, if the organism makes a decision to move its head towards the right, the prediction network would be able to predict that the organism’s visual field of view is going to shift accordingly. If it decides to eat a piece of food, the prediction network will learn to predict that in a future world state, the organism is going to feel less hungry.

How could this decision module work?

Assume that the organism has at its disposal a utility function over its internal world representation. The utility function takes the current internal world representation, and produces a single number that represents how favorable this state is for achieving the organism’s goals (not be hungry, procreate etc.).

The decision module can now leverage the learned world prediction model in order to make good decisions. To consider one of its available actions, it will simply “pretend” that the action has already been taken (temporarily add that action to the short-term memory). It will then use the prediction network to understand how this action would change the internal representation of the world in the future. The decision module can then invoke the utility function to assess how favorable this predicted future state would be. After repeating this process for each available actions, the action that is predicted to lead to the most favorable world state will be taken.

A predictive world model enables the organism to reason about the likely outcomes of its actions, make smart decisions, and act proactively.

Exploration vs Chaos

Addressing urgent needs aside, it would be beneficial for our organism to go around and explore the world. Exploration will help it improve its world model, which in turn will better equip it to make smart decisions in the future.

A related, but different, concept to what I’m going to discuss here is the exploration / exploitation trade-off in reinforcement learning (e.g. Berger-Tal et al. 2014). Being familiar with this concept will help better understand the following sections, but is not absolutely required.

Exploration – An Adversarial Problem?

Something interesting happens when we try to encourage our organism to explore.

The learning objective for its world model and the objective for its decision module diverge. In fact, they become adversarial to each other. Exploration is most effective when it is directed at scenarios that are not yet well understood by the world model, as these scenarios provide the biggest potential for improvement.

Therefore, an exploration-maximizing decision module will specifically aim to seek out situations that the world model performs the worst in. In other words, seek out situations where the world model’s reconstruction accuracy (for the present state) is lowest and/or its prediction error (for a future state) is highest. Meanwhile, the world model will do its best to adjust in order to achieve the opposite outcome.

The world model’s learning incentives are opposite to the organism’s decision incentives!

To maximize its understanding of the world, the organism has to seek out situations that maximize prediction error, while its models will aim to minimize prediction error.

The Puddle of Doom, aka Chaos

Unfortunately, this simplistic formulation of the exploration objective does not quite work.

To see why, imagine that on its quest to explore and understand the world, our organism encounters a puddle with a large pile of pebbles next to it. In its curiosity, the organism picks up a pebble and throws it into the puddle. Little does our organism know that this innocent experiment is going to spell its eternal demise.

Upon throwing the first pebble, the organism observes the wild splashes of water.

Exciting! Something new that its world model isn’t yet able to predict! Witnessing the puddle’s response to the pebble’s impact, the world model quickly adjusts. It has learned what a puddle looks like when a pebble is thrown into it. Being content with this great progress in world understanding, the organism throws another pebble. But wait! This time, the splashes look all different. In fact every future pebble toss turns out to produce completely different splashes. One time there are 39 small drops flying off in specific directions. The next time, there are 53 of them flying off in entirely different directions and originating from slightly different starting points. For our exploration-focussed decision module, this is a happy place! Every single time it decides to throw another pebble, the world model’s prediction error is high. The world model will do its best to learn, but it will never be able to reach an accurate enough prediction.

Our organism just got stuck. It will waste years and years of throwing pebbles into the puddle, trying to finally break through and find that one elusive mental model that exactly predicts the resulting splashes. Its best hope is to eventually run out of pebbles and be forced to move on. Meanwhile, it will face a future of permanent ignorance to all the other, much more useful and predictable things our organism has yet to learn about the world, as its prediction-error-maximizing decision module will keep drawing it back to that fateful puddle over and over.

At a fundamental level, the reason for this problem in a deterministic world is chaos. At a practical level, limitations in the world model’s expressiveness also limit how well that model will be able to fit and predict any given scenario.

Update: Since writing this post, Maven user Marius has pointed out to me that this problem is better known in the reinforcement learning community as the “noisy TV” problem. OpenAI has a nice summary of the problem in their blog post “Reinforcement learning with prediction-based rewards“.

Any sufficiently complex world will have aspects that can’t be fully predicted by a world model. Continuous exploration of these aspects in the hope to perfect their prediction is a waste of time.

Taming Puddles with Boredom

To tackle this problem, we need to refine the exploration objective.

Rather than seeking out those situations where world model accuracy is merely poor, the decision module should seek out situations where these two conditions hold:

  1. Current world model prediction and/or reconstruction accuracy is poor
  2. As the situation is being explored, world model accuracy increases at a sufficient pace.

In our human experience, we might relate the first criteria as the drive of curiosity. The second one is realized through feelings of frustration and boredom. We’ll eventually give up or get bored when an act of exploration does not yield the reward of further insights, and move on to something else.

I’d speculate that there is some lower-level equivalent to these principles in our brains as well. One that is baked into our “world model” forming circuitry and neurochemistry, and that acts even in much more primitive organisms. It appears that internally, our brains are satisfied if they can predict a few high-level properties of water splashes (such as “there will be several drops, flying off in different directions”), rather than obsessively trying to figure out the exact visual stimuli that will be received. There might well be a relation to “invariant” learning here as well – figuring out that you can build a more abstract neural representation of water splashes that is invariant under its chaotic details. Our brains appear to be willing to abandon the goal of perfect reconstruction, as long as salient properties of a situation can still be accurately predicted. This is a fascinating topic of its own, and I’ll explore it in future posts.

Closing Notes

Focussing learning and exploration efforts on scenarios that yield the highest rate of improvement is not only useful in the context of artificial organisms, robots and other autonomous agents.

Consider for example an LLM that is being trained self-supervised on a vast body of text. The allocation of computing resources at training time is a big consideration in the training of such models. If we could focus training resources on those pieces of text that allow for the highest rate of improvement in model performance, that could be very valuable!

Another interesting scenario could be that of an AI-based, multi-turn assistant. While assisting with a user’s inquiry, the assistant will first need to obtain a sufficiently accurate understanding of the user’s intention. Only then should it take the appropriate resolution steps or suggest a solution to the user. If the user’s initial inquiry is ambiguous or misses important details, the assistant should figure out the best course of clarifying questions that provide the fastest path to a sufficient understanding of the user’s intentions. This is comparable to the assistant trying to quickly build a “world model” of the user’s intention.

Some of the principles around exploration laid out in this post might be applicable to these fields. Though the devil is doubtlessly in the details, and it is not immediately obvious to me how to do so.

Last but not least, I have to admit that I haven’t done close to the appropriate amount of literature review on this topic. Research areas such as novelty search come to mind, and might already have many of the answers. In my defense, what’s the fun in just googling the answers though?

If you’re shouting in your chair now “Dude, all of this has been documented 30 years ago by Schmidhuber et al.! Plus you got it all wrong…”, I’d like to: 1. apologize for my ignorance, and 2. politely ask that you leave a comment below, so I can read up on how greater minds than myself have thought about these problems.

Update: I was trying to make a joke by referencing Schmidhuber, but I’ve since learned that Schmidhuber has indeed covered almost this exact topic, with a very similar proposed solution, in his 1991 paper “Curious model-building control systems“! What are the odds… There have also been other solutions to the puddle problem, several of them listed in OpenAI’s “Exploration by Random Network Distillation” (Burda et al).

Key words: world model, active learning, embodied agents, reinforcement learning, online learning, self-supervised learning, exploration-exploitation dilemma, puddle paradox

Leave a comment

Create a website or blog at WordPress.com