This posting forms part of the talk ‘Intelligence and the Brain’. On its own, here, it provides an as-simple-as-I-can-make-it introduction to the ideas underpinning Karl Friston’s ‘Variational Free Energy’ theory which link what the brain is doing to create intelligence to entropy, just as life is related to entropy.
15. Variational Free Energy
‘Variational Free Energy’, posited by Prof. Karl Friston (University College London) as recently as 2006, has been referred to as a ‘unified brain theory’ and can be seen as such in 2 ways:
- It offers an overarching theory of what the brain is actually doing.
- A number of existing partial-theories can be seen as a subset of this theory.
To summarize the theory in just 7 words, the theory’s slogan is ‘minimization of surprise through action and perception’. To introduce the theory, I need to cover 2 strands:
- Entropy and ‘Free Energy’: concerning information theory, and
- The Bayesian Brain: concerning probability.
Let’s look at the first of the 3 strands: entropy. Entropy comes in 2 flavours:
- ‘Shannon entropy’, named after Claude Shannon who invented the concept in 1948 along with Information Theory (which is the foundation for the signal processing to allow you to get such a good mobile or broadband throughput with such a small battery / such a poor bit of ‘wet string’ that is the phone line into your house).
- Classical entropy, part of thermodynamics, developed by various physicists (not least, Ludwig Boltzmann around 1877) in the 19th Century pertaining to (ultimately motivated by how to build better steam engines).
In Information Theory, entropy is a measure of ‘unpredictability’ or ‘information content’. If I have a 1 Megabyte file that is just the 4 letters ‘blah’ repeated a million times, there is much less that 1 Megabyte’s worth on information there. The message “’blah’ repeated a million times” just takes 31 characters (bytes). If you zipped up the 1 Megabyte file, it would be less than 31 bytes. Zipping up (compressing) files is one reasonable way to find out the amount of information contained in the file. Or if the file is audio, use MP3 compression instead. ‘Unpredictability’ is related to ‘surprise’ (as in the slogan ‘minimization of surprise through action and perception’) and surprise can be defined mathematically:
‘surprise’ = -log(P)
where P=probability. So,
- If I toss a coin and it comes up heads, there’s not much surprise: -log2(0.5) = 1.
- Whereas if I provide you with some numbers that turn out to be next week’s winning 1-in-a-million lottery numbers, you will be very surprised: -log2(1/1000000) = 19.9.
Thermodynamic entropy is commonly described as being a measure of disorder but is perhaps better understood in terms of ‘energy dispersal’. In abstract terms:
entropy, S = k.log(W)
…where W is the number of microstates (possibilities).
To provide extreme (cosmic) examples of ‘energy dispersal’:
- At the big bang, everything in the universe was together, hence S is low.
- At the eventual ‘heat death of the universe’, where the universe is expanding but all the stars have died out, S will be very high.
To provide a more down-to-earth example, if we start with a box with different compartments in which a particular gas is in just one compartment (a state of low entropy) and then open the doors between the compartments, when we come back an hour later, we will find that the gas has dispersed to all compartments (a state of high entropy).
This ‘Brownian Motion’ is an example of that fundamental law of physics – the second law of thermodynamics. There is a tendency towards disorder (increased entropy). We would be very surprised if we started with the high-entropy state and came back an hour later to find the box in a low entropy state. We would suspect interference by some intelligence creature. In theoretical physics, there is a thought experiment involving one such intelligent creature – ‘Maxwell’s Demon’ – that opens and closes the compartment doors at will in order to trap all the gas in a single compartment.
Information Theory (or Shannon) entropy was originally formulated by Shannon in an analogous way to thermodynamics but is has subsequently been shown that the 2 concepts are in fact related in a fundamental way. Because Information Theory entropy is even more abstract than thermodynamic entropy, I am going to relate things to thermodynamic entropy in what follows here. But Friston’s ‘Free Energy’ theory concerns the Information Theory variety. ‘Free Energy’ is a concept in thermodynamics that is very similar to entropy and subsequently taken across to Information Theory. As far as this talk is concerned, we need not make any distinction.
17. Thermodynamic Entropy and Life
In a series of public lectures in Dublin in 1943, the physicist Erwin Schrödinger famously asked ‘What is Life?’ and the answer he gave is that living things maintain their own order at expense of their surroundings. They do not depend on some magical law that counteracts the second law of thermodynamics. They just ‘export’ their own disorder. So, for example, Schrödinger is famous for thought experiments in which cats are put in boxes. Imagine is we put a cat in a box not with some radioactivity but with a mouse. When we look later, we would expect to find the cat, a mouse carcase and some cat faeces. The cat has maintained its own order by transforming the mouse into something less ordered. This order that is ‘exported’ has become known as ‘negentropy’.
An example near the other end of the biological scale is the sodium-potassium ion pump. Thousands of these small biological machines sit in the walls of neurons and other cells. They collect 2 Potassium ions (atoms) from outside the cell and 3 Sodium cells from inside the cell as swaps them around, bringing the Potassium in and sending the Sodium out. (This is a significant component in creating a voltage across the cell – the ‘membrane potential’ -that allows neurons to fire.) I want to make comparisons here between the Sodium-Potassium ion pump and Maxwell’s Demon – imagine starting off with 2 compartments in a box with a mixture of gasses, with Maxwell’s Demon sorting them so that eventually there are separated out. These little machines are locally working against the natural tendency towards disorder (but they need energy to operate hence, on a larger scale, the second law of thermodynamics is not violated).
(For a stunning tour of the sodium-potassium ion pump, Click here.)
18. The Bayesian Brain
To understand Bayesian inference, we need to understand Bayesian probability as opposed to the classical interpretation of probability.
The classical interpretation is sometimes called ‘frequentist’. Probability represents a ‘propensity’, based on how things turn out ‘in the long run’. So, as we were taught in school, we can employ such probabilities when we find ourselves picking a red or black ball out of a bag at random when we conveniently know there are 100 red and 300 black balls in the bag.
In contrast, the Bayesian interpretation is a form of ‘subjective probability’ in which the probability represents a degree of belief. Thus a probability of ‘1-in-a-million’ means ‘not likely!’ rather than ‘I have conducted experiments on this and replayed this scenario billions of times and I find that the chance of x happening is 0.000001’.
Next, we need to understand Bayesian inference. Inference is about deriving conclusions from assumptions. David Hume famously considered the philosophical problems of inferring that the sun will rise tomorrow because it has risen every day before that. And Bertrand Russell famously then gave the example of the chicken that infers that the approaching farmer is bringing food because that is what has happened every day beforehand – but today is the day the farmer instead just picks up the chicken and wrings its neck.
We do not need to concern ourselves with the maths here but Bayesian inference is based on Bayes theorem:
P(H|D).P(D) = P(D|H).P(H)
which can be expressed as
P(H|D) ∝ P(D|H).P(H)
which is interpreted as
posterior ← likelihood . prior
- We start with a prior degree of belief.
- New evidence comes along.
- We then calculate the new (posterior) degree of belief, based on our previous degree of belief and the new evidence. This new degree of belief can be more or less than it was before, depending on the evidence.
We modify our predictions as a result of new information. And in using Bayes theorem to do it, this modification is optimal – which sometimes gets equated with being ‘rational’.
A real-world application of Bayesian inference is in spam filters for e-mail. When you receive an e-mail, the spam filter decides whether it will be put into your inbox or junk folder (recall that the etymology of the word ‘intelligence’ is ‘to choose between’). When you move a file from the junk folder to the inbox, you are telling it that it got it wrong. At this point, it will try to learn how to choose between junk and non-spam better, given this new information. It would ideally look afresh at all the e-mails that it has ever received and try to work out how best to discriminate junk mails. But it cannot look back at all those emails – many of the inbox emails and virtually all of the junk emails have been deleted. The data has been thrown away. All it has to go on is the statistics about those emails that it has collected. And it is these statistics that are updated in a Bayes-optimal fashion such that, after the early days, it gets thinks right the vast majority of the time. It will adapt itself to the environment of your emails – even if you’re working in the banking sector of Nigeria.
And neuroscientists are increasingly understanding the way the brain works in Bayesian terms.
(There are many more-thorough explanations of Bayesian inference available on the net – generally about an hour long but stop by Daniel Wolpert’s 20-min ‘The Real Reason for Brains’ TED talk on the way.)
19. Entropy and the Bayesian Brain
In order to provide a very basic explanation of the notion:
‘Minimization of surprise through action and perception’
…I am going to look at a rather contrived example of answering a question of the game show ‘Who Wants to be a Millionaire?’.
Firstly, imagine if we were forced to give an answer before the question is even asked! We would be rather confounded. There is no information! All 4 possible answers are equally likely. It would just be a random guess:
- A: 25%
- B: 25%
- C: 25%
- D: 25%
But we are presented with the question:
‘What is the Capital of Australia?’
and, as a result, we have some expectations:
- Sydney 35%
- Melbourne 10%
- Adelaide 5%
- Brisbane 5%
- Perth 2%
- Banana 0%
- 350ml 0%
When presented with the 4 possible answers, our options are narrowed down and our expectations change:
- A: Melbourne 12%
- B: Sydney 38%
- C: Canberra 7%
- D: Brisbane 5%
A change in stimulus causes a change in expectation. Perception is an active process (recall Richard Gregory’s ‘perception as hypothesis’).
We then decide to go ‘50:50’. This lead to a big surprise – both the most likely candidates have been removed:
- A: Melbourne 0%
- B: Sydney 0%
- C: Canberra 30%
- D: Brisbane 15%
There is a big difference between the previous (prior) and current (posterior) expectations. This large prediction error represents a significant information gain. (The difference between prior and posterior expectations is called the ‘cross-entropy’ or, more impressively, the Kullback-Leibler divergence).
As an alternative to the Popperian idea of imagining actions in our head so that the bad ones may ‘die in our stead’, action out in the environment may be see as ‘performing the experiment to optimize the model’.
After another action, ‘Ask the Audience’, our inclination that Canberra is the right answer is confirmed, leading us to choose answer C:
- A: Melbourne 0%
- B: Sydney 0%
- C: Canberra 90%
- D: Brisbane 5%
With so much expectation in one ‘compartment’, this is analogous to the low entropy state of the box, before the compartment doors were opened. Thus, a combination of perception and action have changed our expectations from a high-entropy distribution to one of low entropy. I’ve tried to provide as simple as possible an explanation of the slogan:
‘Minimization of surprise through action and perception’
There seems to be an interesting relationship here. Just as
life is about counteracting thermodynamic entropy,
intelligence is about counteracting information theory entropy.
The acceptance of intelligence in the form of ‘artificial intelligence’ divorces intelligence from life. But the understanding presented above makes us understand how we should not be surprised that intelligence has arisen from life.