The Bayesian Origins of Neural Nets

[This post is on the ‘Artificial Intelligence’ strand of this blogsite]

TL;DR: Just skip down to the big picture and read the ‘The End Result’ section.

Introduction

As part of my long journey towards understanding Karl Friston’s ‘Free Energy’ theory of the brain, I stumbled upon Oleg Solopchuk’s excellent “Tutorial on Active Inference” medium.com article. At some point I will give it full justice by reading in detail; even as a tutorial, it is not a light read. For example:

“So by minimizing Free Energy, the arbitrary distribution q(s) gets closer to the posterior p(s|o), and if they match, KL term becomes 0 and Free Energy is exactly equal to surprise. So the more we minimize Free Energy by wiggling q(s), the more it becomes closer to surprise (since it’s an upper bound), and by minimizing Free Energy by wiggling parameters of p(o,s) we can minimize surprise even further.”

This is typical of impenetrable Fristonian text. The problem here is the complexity of the underlying ideas, not the Herculean task of trying to explain it. Trying to understand Friston is notoriously difficult. Reading the deluge of complex math strung together with long sequences of ordinary words forming nonsensical sentences, it becomes funny rather than profound (see the parody Twitter account @FarlKriston).

In ‘God Help Us, Let’s Try To Understand Friston On Free Energy’, Scott Alexander indirectly quotes someone’s Neuropsychoanalysis paper (without citing):

“At Columbia’s psychiatry department, I recently led a journal club for 15 PET and fMRI researchers, PhDs and MDs all, with well over $10 million in NIH grants between us, and we tried to understand Friston’s 2010 Nature Reviews Neuroscience paper – for an hour and a half. There was a lot of mathematical knowledge in the room: three statisticians, two physicists, a physical chemist, a nuclear physicist, and a large group of neuroimagers – but apparently we didn’t have what it took. I met with a Princeton physicist, a Stanford neurophysiologist, a Cold Springs Harbor neurobiologist to discuss the paper. Again blanks, one and all.”

He goes on:

“Normally this is the point at which I give up and say “screw it”. But almost all the most interesting neuroscience of the past decade involves this guy in one way or another. He’s the most-cited living neuroscientist, invented large parts of modern brain imaging, … His Am I Autistic – An Intellectual Autobiography short essay, written in a weirdly lucid style and describing hijinks like deriving the Schrodinger equation for fun in school, is as consistent with genius as anything I’ve ever read.”

Oleg’s article is a big step-up. He strongly recommends Rafal Bogacz’s tutorial and is right to do so – it is another big step up the long ladder up to the lofty heights of a real understanding of the ‘Free Energy’ theory of the brain.

My post here is an overview of Bogacz’s tutorial – a rung or two below that on the ladder. In increasing levels of technicality (and general impenetrability), some of the rungs are:

Previous posts by me are near the bottom: ‘Entropy, Intelligence and Life’ and ‘Free Energy: Hierarchical Message Passing’ (both as part of the wider ‘Intelligence and the Brain’ talk).
This post.
Oleg Solopchuk’s “Intuitions on predictive coding and the free energy principle” excellent medium.com post (“This non-technical post is on theoretical neuroscience and aims to give simple visual intuitions without entering into the underlying math”)
Oleg Solopchuk’s “Tutorial on Active Inference” medium.com article.
Rafal Bogacz’s tutorial.
“An Approximation of the Error Backpropagation Algorithm in a Predictive Coding Network with Local Hebbian Synaptic Plasticity” by James C. R. Whittington and Rafal Bogacz naturally follows on from the above,
The many original papers by Karl Friston himself, such as ‘The free-energy principle: a unified brain theory?’.

(Emphasize: These are only some of the rungs.)

At the top of the ladder, I suspect there are less than 100 people in the world who really understand. At the bottom rung, my aim is to provide an understanding potentially to hundreds of millions of people [hmmm: number of blog visitors might need to increase a bit though]:

old enough to have completed high-school in order to have been exposed to sufficient math (differentiation, probability, …).
young enough to be exposed to programming at school and able to code (preferably in Python, the most widely taught language in schools).

Bogacz’s paper includes ‘Try this Yourself’ suggestions for going through the math equations to get a proper understanding. He also has programming exercises, with Matlab answers in the appendix. I have previously said, this is where real understanding comes from – being able to replicate the behaviour in a simulation:

Andy Clark tells a tale of how, years ago, the philosopher Dan Dennett was talking to an eminent professor of paleontology (who he does not name). The professor complained that his students were cheating at stratigraphy by just copying out diagrams from books. Dennett responded that he should get them to play a computer game. If you want to test their understanding of stratigraphy, you don’t ask the question ‘describe how such-and-such a layer is formed’; you provide them with a game in which various knobs cause various events, such as: Deposition of sediment, erosion, intrusion of lava, and controlled fracture.

If the students can play the game by twiddling the knobs in a correct order and with the correct intensity to create the required layers, they have demonstrated that they really do understand the process and can be marked accordingly.

Note: The section numbers in the following sections map across to those in the Bogacz paper.

2: Simplest example of perception

Sense inputs are noisy and non-linear: there is a difference between how things actually are in the outside environment and how they have been sensed. (It might be useful to think of this in Kantian terms: the noumenal and the phenomenal respectively). My example: sounds made by another, with background noise as well, are converted to irregular firings of multiple neurons, each neuron tuned to respond to a particular frequency, not necessarily spaced at regular frequency intervals.

Bogacz’s example is of light intensity being indicative of object size. At one instant in time, the light intensity might indicate an object of size s₁. But the probability of there actually being an object of size s₁ might be very low. It might be more likely that what has been observed is an object of size s₂, with a particular level of noise. Bayes theorem can help us in this circumstance to make best guesses about the size of the object.

Aside: Probabilities and Bayes Theorem

(See Markov Blankets and the Goat in the Gameshow for previous introductions to Bayes theorem.)

Probabilities may be either ‘marginal’, ‘joint’ or ‘conditional’:

Marginal probability: unconditional P(A) = P(playing pack card is a heart) is just ¼. P(card is a 4) is just ¹/₁₃.
Joint probability: P(B) = P(4 of hearts) = P(A ∩ B) = ¼ x ¹/₁₃ = ¹/₅₂.
Conditional probability: P(A|B) is the probability of event A occurring, given that event B occurs. Example: P(king)=¹/₁₃ and P(face-card)=³/₁₃ but P(king | face-card)=¹/₃.

We can relate different types of probability as follows:

P_conditional(H|E) = P_joint(H ∩ E)/P_marginal(E).

And noting that joint probabilities are ‘commutative’:

P_joint(H ∩ E) = P_joint(E ∩ H)

Then:

P_joint(H ∩ E) = P_conditional(H|E).P_marginal(E)
P_joint(E ∩ H) = P_conditional(E|H).P_marginal(H)

hence Bayes Theorem:

P_conditional(H|E).P_marginal(E) = P_conditional(E|H).P_marginal(H)

or rearranging to the more familiar form:

P_conditional(H|E) = P_conditional(E|H).P_marginal(H)/P_marginal(E)

which expresses a ‘posterior’ in terms of the ‘likelihood’, ‘prior’ and ‘evidence’ (respectively):

P(H|E) = P(E|H).P(H)/P(E)

2.1: Exact solution

In Bayesian terms in this instance:

the ‘hypothesis’ is that the size is a particular value.
The ‘evidence’ is a particular light intensity.
The ‘prior’ is the probability before having being given the evidence.
The ‘posterior’ is the probability after having being given the evidence.

It is important to remember that the priors and hypotheses are probability distributions. These are represented by multiple probabilities across the full range of all possible values. The probability distribution for a die is across 6 possible values – each with probability ¹/₆. We may over time learn that the die is unfair and provide a different distribution.

2.2: Finding the most likely feature value

It is impractical to store such distributions in our heads. Distributions will tend to go from low at extremes to a peak somewhere. We can approximate a distribution as a Gaussian function with just 2 numbers: a mean (average) and a variance (how spread out).

The size estimate will change over time, dependent upon the changing environment plus noise. The best guess of the size is at the peak of the probability distribution. We can find the location of that peak using gradient descent: Note: gradient descent (specifically, stochastic gradient descent) is a standard technique in training neural nets. For this, we use differentiation with respect to time to hunt for the peak:

If the slope at our current guess is positive, it must be lower to the left hence we shift left by subtracting something from our guess value (proportionate to the slope).
If the slope at our current guess is negative, it must be lower to the right hence we increase our guess value.

Iterating, we home in to the best guess.

2.3: A possible neural implementation

‘Prediction errors’ are an important concept in understanding ‘Predictive Processing’. The gradient descent math is simplified by using two intermediate terms which are prediction errors:

how much the light intensity differs from that expected at a particular size.
how much the inferred size differs from prior expectations.

The equations can be converted to cyclic directed graphs: ‘bubble diagrams’ where each bubble represents a neuron, or vector of neurons.

Because of this relationship between the equations and neurons, we can associate the neurons with some original Bayesian purpose – such as representing the mean or the variance of a particular expected probability distribution.

I’m not going to show step-by-step bubble diagrams for each section here (you can refer to the original paper from those) but just show an overall bubble diagram at the end which brings all the ideas together.

2.4. Learning model parameters and 2.5. Learning the relationship between variables

As time goes by, we need to update our expectations based on previous experience. Differentiating with respect to the model parameters (the mean and variance) rather than time shows us how to update those mean and variance values.

The differentials can be expressed solely in terms of the neuronal values and a single parameter. This gives us something that maps very well to real neuron ‘learning’: Hebbian learning where ‘neurons that fire together wire together’ rather than relying on mechanisms like in back-propagation in artificial neural networks that are not biologically plausible.

[An exercise for the future: It would be nice to compare what we have here with online implementations of Gaussian Mixture Models, where data is fitted to one of a number of Gaussian curves, with the means and variances of those Gaussians being adjusted on-the-fly.]

3: Free-energy

Here I just skip over the math and discussion around (Friston’s) Free Energy concepts here. That is for another time. Here, I am just interested in getting from Bayes theorem to multi-layer neural nets.

4: Scaling up the model of perception

Here, sections 4 and 5 just explain development of the framework into a full network of neurons. You can ignore these sections if you want and jump straight to ‘The End Result’.

4.1. Increasing the dimension of sensory input

Scaling up from a network that tracks the distribution of just one parameter, to one with many parameters just means converting all the operators to vectors.

4.2. Introducing hierarchy

The bubble diagrams have been drawn with inputs coming in from the left and constants coming in from the right. This single-stage can be easily scaled up so that inputs cascade between stages from left to right, and the signals from the right being feedback signals from the stage above. This layering is what is found in the 6-layer neocortex of real brains.

Having already made the transformation to vectors in section 4.1 to handle multiple neurons per layer, there is a fully-connected matrix of connections between the two different types of neurons (ε and φ) within each layer.

5. Local plasticity

Scaling up to multiple neurons per layer has an unfortunate consequence of there needing to be a matrix inversion to handle learning. This means that every neuron relies upon every other neuron in order to learn which is biologically implausible.

To work around this, extra neurons are added (e neurons in the diagrams, fully connected to the ε neurons) with the benefit of having each connection’s updating dependent only on the (local) neurons (nodes) at the two ends of the connection i.e. the learning is Hebbian.

Overall neural architecture

The End Result

The end result of Bogacz’s tutorial journey is a network of neurons as shown above (not shown in so complete a form in the original paper).

The diagram above shows a 4-layer neural network, with 2 inputs per layer. There is an input layer (green), 3 internal layers (blue, red and orange) and a top stub (purple). This is the result of a journey that started with Bayesian estimation of noisy inputs. Those original Bayesian distributions with mean v_p and variance ∑, with a non-linear input function u=g(v)=θ.h(v) and producing a peak value φ, have been through many transformations, with simplifications through the introduction of ε error terms, into what looks very much like a multilayer neural network. That network still shows signs of its origins in referencing the ∑, θ, ε, v_p and φ terms.

Each layer of the network is fully-interconnected to the layer above (towards the purple) with neural weights θ. A φ neuron has an (upstream) output that is a non-linear function of a weighted sum of lower-level neuron outputs. The θ terms are the weights.

In neural nets:

that non-linear function is often the sigmoid function and
the weights are adjusted during back-propagation based on the weights and the derivative of the non-linear function.

Here, the weights are indeed the same in both directions (ε forward to v_p and φ backwards to ε) but the function and its derivative are the other way around.

The e neurons (linear nodes) have been added to help compute the prediction error and to learn the variance of the corresponding feature with a local Hebbian plasticity rule. Note that since all the e neurons are full-connected to the ε neurons, the ∑ weights indicate co-variances.

The Hebbian rule is that the weight change is based on the pre- (ε) and post- (φ) synaptic neurons, and only them (“Neurons that fire together, wire together”). This is true for both:

θ weights between ε and φ: The weights are updated based on the gradient ∂F/∂θ = ε.h(φ)
∑ weights between ε and e: The variance weights are updated according to the rule Δ∑ = α(ε.e − 1) where α is the learning rate parameter.

In the former case, the updates are proportional to the product of the pre- (ε) and post- (φ) synaptic neuron states. In the latter case, this is nearly so.

Python Code Solutions to the Exercises

The Bogacz paper includes Matlab code solutions to the exercises (numbered 1 thru 5, excepting exercise 4 which is not a programming exercise). As you are unlikely to have Matlab to run the code, you could install Octave instead which should run, perhaps with some minor modifications.

Here, I provide Python alternatives (exercises 1, 2 and 3 have been written from scratch. Exercise 5 has been lazily recoded from the paper’s Matlab solution). As with Python code in previous posts, if you don’t know how to install and run Python code, the simplest thing to do is copy and paste the code into an online Python 3 interpreter, such as https://repl.it/@jef12345/python-361 (and click run).

All exercises have been combined into a single piece of code. Just set the exercise_number variable to the number you want.

#! /usr/bin/env python

#import os, sys, random, time
import numpy as np
from scipy.stats import norm
import matplotlib
import matplotlib.pyplot as plt
import math

exercise_number = 5

if exercise_number <= 3: # Common between exercises
    def function_g(size):
      average_light_intensity = size*size
      return average_light_intensity

    def function_g_tick(size): # derivative of g w.r.t. phi (peak size probability)
      derivative = 2*size
      return derivative

    def function_f(x, mean, variance):
      std_deviation = math.sqrt(variance)
      return norm.pdf(x, mean, std_deviation)

    # Bayes theorem
    # P(H|E)= P(E|H).P(H)/P(E)
    # H:        hypothesis
    # E:        evidence
    # P(H):     prior
    # P(H | E): posterior
    # P(E | H): likelihood
    # P(E):     evidence
    def bayes_posterior(prior, likelihood, evidence):
           return prior * likelihood / evidence 

    # observed light intensity
    u = 2.0
    sigma_u = 1.0 # variance of the light intensity
    # prior: experienced inputs over the +/-1 std_deviation interval 2...4:
    v_p = 3.0
    sigma_p = 1.0

if exercise_number == 1:
    print('#########################################')
    print('Exercise 1')
    print('#########################################')

    integral = 0 # scalar integral
    steps = 500
    v           = np.zeros(steps) # actual size
    p_v         = np.zeros(steps)
    p_u_given_v = np.zeros(steps)
    p_v_given_u = np.zeros(steps)
    for i in range(steps):
      v[i] = (i+1)/100 # size 0.01...5
      p_v_given_u[i] = (i+1)/100
      p_v[i]         = function_f(v[i], mean=v_p, variance=sigma_p)
      p_u_given_v[i] = function_f(u, mean=function_g(v[i]), variance=sigma_u)
      integral += p_v[i] * p_u_given_v[i] # integral( p_v * p_u_given_v, dv)

    print("integral = %10.5f" % integral)
    # integral should be 1 but it is actually larger
    p_u = 1/integral # scale results to ensure that the posterior probabilities of all sizes p(v|u) integrate to 1

    for i in range(steps):
      p_v_given_u[i] = bayes_posterior(prior=p_v[i], likelihood=p_u_given_v[i], evidence=p_u)
      #print("%P(v)=10.5f P(u|v)=10.5f P(u)=10.5f " % (p_v[i], p_u_given_v[i], p_u))
      temp = p_v[i] * p_u_given_v[i] / p_u
      #p_v_given_u[i] = temp
      #print('%10.3f   %10.3f' % (x_array[n], y_array[n]))

if exercise_number == 2:
    print('#########################################')
    print('Exercise 2')
    print('#########################################')

    def dF_by_dphi(t, phi):
      first_term  = (v_p - phi)/sigma_p
      second_term = (u - function_g(phi))/sigma_u * function_g_tick(phi)/sigma_u
      print("dF_by_dphi: t=%3d phi=%10.5f term1=%10.5f term2=%10.5f" % (t, phi, first_term, second_term))
      return first_term + second_term

    delta_t = 0.01

    tsteps = 500
    phi    = np.zeros(tsteps)
    phi[0] = v_p # start with the mean of the prior
    for t in range(tsteps-1):
        phi[t+1] = phi[t] + delta_t * dF_by_dphi(t, phi[t])
        #print("t=%3d phi[t]=%10.5f" % (t+1, phi[t+1]))

if exercise_number == 3:
    print('#########################################')
    print('Exercise 3')
    print('#########################################')

    delta_t = 0.01

    tsteps = 500
    phi    = np.zeros(tsteps)
    epsilon_u    = np.zeros(tsteps)
    epsilon_p    = np.zeros(tsteps)
    phi[0] = v_p # start with the mean of the prior
    epsilon_u[0] = 0
    epsilon_p[0] = 0
    for t in range(tsteps-1):
        # From the equations...
        # Errors:
        #epsilon_u = (u - function_g(phi))/sigma_u #(11)
        #epsilon_p = (phi - v_p)/sigma_p #(10)
        #dF_by_dphi = -epsilon_p + epsilon_u * function_g_tick(phi)
        #phi = phi + dF_by_dphi * delta_t

        # From above, we can derive Figure 2 relationships...
        epsilon_u_dot = u   - function_g(phi[t]) - sigma_u * epsilon_u[t]  #(14)
        epsilon_u[t+1] = epsilon_u[t] + epsilon_u_dot * delta_t

        epsilon_p_dot = phi[t] - v_p             - sigma_p * epsilon_p[t]  #(13)
        epsilon_p[t+1] = epsilon_p[t] + epsilon_p_dot * delta_t

        dF_by_dphi = -epsilon_p[t] + epsilon_u[t] * function_g_tick(phi[t])
        phi[t+1] = phi[t] + dF_by_dphi * delta_t
    print("t=%3d phi[t]=%10.5f e_p[t]=%10.5f e_u[t]=%10.5f" % (tsteps-1, phi[tsteps-1], epsilon_p[tsteps-1], epsilon_u[tsteps-1]))

if exercise_number == 5:
    print('#########################################')
    print('Exercise 5')
    print('#########################################')

    mean_phi  = 5;    # mean of input from the current level
    Sigma_phi = 2;    # variance of input from the current level
    phi_above = 5;    # input from the level above
    NUM_STEPS = 2000
    DT        = 0.01  # integration step
    MAXT      = NUM_STEPS * DT    # maximum time considered
    print("Simulate from t=0 to t=%d" % MAXT)

    error     = np.zeros(NUM_STEPS+1)
    e         = np.zeros(NUM_STEPS+1)
    TRIALS    = 1000  # number of simulated trials
    LRATE     = 0.01  # learning rate
    Sigma     = np.zeros(TRIALS+1)
    Sigma[0]  = 1     # initializing the value of weight
    for trial in range(1, TRIALS+1):
        error[0]  = 0 # initializing the prediction error
        e[0]      = 0 # initializing the interneuron
        phi = mean_phi + math.sqrt(Sigma_phi) * np.random.normal()
        for i in range(1, NUM_STEPS+1):
            error[i] = error[i-1] + DT * (phi - phi_above - e[i-1])
            e[i]      = e[i-1]    + DT * (Sigma[trial-1] * error[i-1] - e[i-1])
        Sigma[trial] = Sigma[trial-1] + LRATE * (error[NUM_STEPS]*e[NUM_STEPS] - 1)

print('Plotting...')
matplotlib.style.use('default') # or 'seaborn'
# initialise graph
fig1, ax1 = plt.subplots(figsize=(6, 3))    # size of graph window

if exercise_number == 1:
    ax1.set_title('Exercise 1'.format('default'), color='C0')
    # Distribution plots
    ax1.plot(range(steps), p_v,         'b', label='P(v)',      gid = 'AX1_Pv')
    ax1.plot(range(steps), p_u_given_v, 'y', label='P(u | v)',  gid = 'AX1_Pu_given_v')
    ax1.plot(range(steps), p_v_given_u, 'g', label='P(v | u)',  gid = 'AX1_Pv_given_u')

    # Graph niceties: ranges and legend
    ax1.set_xlim(left=0, right=500)
    ax1.set_ylim(bottom=0.0, top=1.0)

if exercise_number == 2:
    ax1.set_title('Exercise 2'.format('default'), color='C0')
    # Distribution plots
    ax1.plot(range(tsteps), phi,         'b', label='phi',      gid = 'AX1_phi')

    # Graph niceties: ranges and legend
    ax1.set_xlim(left=0, right=tsteps)
    ax1.set_ylim(bottom=-0.1, top=5.0)

if exercise_number == 3:
    ax1.set_title('Exercise 3'.format('default'), color='C0')
    # Distribution plots
    ax1.plot(range(tsteps), phi,       'b', label='phi',      gid = 'AX1_phi')
    ax1.plot(range(tsteps), epsilon_u, 'y', label='epsilon_u',  gid = 'AX1_epsilon_u')
    ax1.plot(range(tsteps), epsilon_p, 'g', label='epsilon_p',  gid = 'AX1_epsilon_p')

    # Graph niceties: ranges and legend
    ax1.set_xlim(left=0, right=tsteps)
    ax1.set_ylim(bottom=-2, top=5.0)

if exercise_number == 5:
    ax1.set_title('Exercise 5'.format('default'), color='C0')
    # Distribution plots
    ax1.plot(range(TRIALS+1), Sigma,     'b', label='Sigma',      gid = 'AX1_Sigma')
    # Graph niceties: ranges and legend
    ax1.set_xlim(left=0, right=TRIALS)
    ax1.set_ylim(bottom=0, top=3)

ax1.legend()
plt.show()