## Space…

In his 1957 book ‘Cosmic View : The Universe in 40 Jumps’, the Dutch educator Kees Boeke demonstrated the vastness of scale in the Universe. The book starts with a picture of a girl on a chair with a cat on her lap. It then zooms out by a factor of 10 to show her in the courtyard of Boeke’s school (incongruously showing a whale lying on its side!), jumping from objects at a human scale (1; 1 metre) to 10 metres. This zooming out continues page by page, to show the fields and houses around the school (100m), in the village of Bilthoven (1km), in the region around the city of Utrecht (10km) in the Netherlands (100km), within Europe (1000km) on planet Earth (10,000km). It continues page by page, bringing in the Moon, the whole of the Solar System and on to the Milky Way galaxy and every galaxy in the Universe (1025m). It then returns to the girl and the scaling goes in the other direction, focusing in on her hand (10cm) to see a mosquito (1cm), a water mite (1mm), bacteria (100μm), Smallpox virus (10μm), bacteriophage (1μm), flagellum (100nm), salt crystal (10nm) and atoms (1Å = 1 Angstrom = 10-10m) to finally stop at the nucleus of a Sodium atom (10-13m).

All 40 stages are shown in my collation above but each image can bee seen fully in the PDF of the book.

## … and Time

Boeke’s book shows the vastness of scales of space; an equivalent book that showed the vastness of time scales would be more difficult. His book could have restricted itself to plain physics, with the reference point being a rock, say, rather than the girl, but this would have been far less interesting. Instead, it shows a range of biology, from viruses up to a whale.

My graphic below attempts to show both space and time, and additionally showing ‘complexity’ on a third axis, going up from Physics, through Biology up to human culture. A grid is provided so that I can refer to regions by grid reference. Each square shows an increase of one order of magnitude in time and space. So the square at the very bottom (location red-Z3) represents a shift in space from 10-18 metres to 10-17m when heading one square towards the top-left and represents a shift in time from 10-24 seconds to 10-23s heading one square towards the top-right. Overall, the red surface represents scales of time and space from 10-18m (quarks, leptons and bosons) and 10-24 seconds to 1027m and 1018s (the size and age of the Universe). The lower limits are much larger than the fundamental smallest possible units – Planck units – of 10-35s and 10-43s but extending that far wouldn’t provide any more illumination. At the other end of the scale, extending beyond the current age and size of the Universe is nonsensical if you subscribe to the ‘heat death’ fate of the universe rather than a ‘Big Crunch’. The precise choice of the limits given here is arguable and the same applies for the other colored layers in the graphic, in what follows.

The ‘origin’ of the chart (location green-H10) is at the reference point of:

• distance = 1 metre,
• time = 1 second

at which a human (such as the girl in the book) is located. Moving along the axes:

• towards E18 increases the timescale.
• towards L2 decreases the timescale.
• towards A8 increases distances.
• towards S13 decreases distances.

The z axis (1 metre, 1 second), going from red-L10 to grey-E10 represents an increase in complexity.

## Complexity: From Quark to Muster Mark

To explain the picture, let’s start with the red Physics rectangle. Quantum mechanics explains how the behavior of fundamental particles (quarks, leptons and bosons) leads to the emergence of the structure of atoms and how they bond. That is, Chemistry emerges (dark-orange-S6), starting at a larger scale of time and space. More statistical mechanics, namely thermodynamics, then accounts for behavior up to the scale of large stars, 100 times larger than the Sun (dark-orange-E14). Large atoms then take us into the emergence of life and the realm of Biology:

1. Molecular Biology is at the lowest level (orange-R6 to E14), then going up through
2. cells (yellow),
3. tissue and organs (light-green),
4. organisms (green, with humans particularly in mind),
5. immediate family of organisms (cyan), and
6. groups of organisms (light-blue), and
7. large populations of organisms i.e. societies (blue).

This can be seen more clearly below: I have attempted to give labels to each level corresponding to the study discipline at that level, those being:

1. Molecular Biology
2. Cell Biology
3. Anatomy
4. Physiology
5. Ethology
6. Anthropology, and
7. Sociology

These 7 biological levels are the same ones as given in the ‘Hierarchical Mechanical Mind’ for the biopsychosocial ‘Period Table of the Human Sciences’ (extending beyond the ‘Period Table of the Human Sciences’ to encroach upon the Humanities), as advocated by Gerhard Medicus.

Understandably, the hierarchy in my diagram is similar to that in others. For example, Rupert Riedl, via Gerhard Medicus’s powerpoint: Relating my graphic to the ‘story’ given in Boeke’s book, the book takes a linear path from the girl at location green-H10 up to the whole universe at location red-D13 and then from the girl down to atoms at around location red-U5.

Across the biological layers (‘Molecular Biology’ to ‘Ethology’) evolution applies, tying together the DNA of the genotype (Molecular Biology) to the ‘extended phenotype’ of the organism, possibly including its immediate environment and even right up to the Economics level.

Note I have put Economics above Sociology. Perhaps you would argue to put it at the same level. But by putting above, it does not interfere with the seven Life-and-Humanities levels.

Stepping from one level up to the next may be the result of higher-level order arising from lower-level order. But in a number of instances, order arises from disorder. For example, there are the familiar examples of statistical mechanics:

• Physics: quantum mechanics, and
• Chemistry: thermodynamics.

But there are two other examples here where higher-level order arises as a result of the quasi-random behaviour of low-level ‘agents’ (as a result of the huge number of them causing statistics to be well-behaved):

• Evolution, and
• the Economics of the free market.

Up to the level of the organism, the smallest relevant scales in time and space continue to get larger with each step up from one layer to the next, as you expect where higher-level behavior arises from the lower-down through statistical mechanics. (Hindsight: The economics layer should have been drawn set back from Sociology.)

The entire edifice of physical, life and social sciences that has built up is then able to support The Arts – not least support for the writing of literature such as James Joyce’s ‘Finnegan’s Wake’. Complexity has risen from Three Quarks to ‘Muster Mark’!

## Cranes lifting Cranes

The complexity has not been achieved in a single step. Self-organization at many levels has occurred to lift the Universe up to ourselves and what we can produce. The Sciences can provide a linked story of explanation that does not need any external intervention.

Part of this story is examined in Dan Dennett’s 1995 book ‘Darwin’s Dangerous Idea’, in which he talks of ‘skyhooks’ and cranes for constructing complexity, there being no such thing as a hook that magically hangs down from the sky. But there are cranes that allow building upwards whilst being grounded upon lower levels.

So, we should not imagine one crane to do the heavy lifting but many – of cranes lifting cranes. And we don’t need to just imagine this – we can actually see the analogy…

## Postscript 1

An indication of the motivation behind this exploration of Time, Space and Complexity…

Below shows Table 1 ‘Spatiotemporal scales and examples’ from ‘Parcels and particles: Markov blankets in the brain’ by Friston, Fagerholm, Zarghami, Parr, Hipólito, Magrou and Razi. ## Postscript 2

Thanks to Sergey Karelov (https://medium.com/@sergey_57776) for his endorsement of this post at ‘There was no plan for the creation of the world, but there was an ingenious plan’.

His post adds some good references

And…

[Although this post concerns Artificial Intelligence, it is in the Neuroscience: Predictive Mind strand of this blogsite (see drop-down tabs) because of its context]

This post follows on from the previous, on Karl Friston’s Dynamic Causal Modelling of COVID-19. The modelling uses the SPM (Statistical Parametric Mapping) ‘toolbox’ (software library) for MATLAB, originally created by Friston for brain imaging analysis, which is freely available at https://www.fil.ion.ucl.ac.uk/spm. The previous post looked at the DEM algorithm applications within the same spm12\toolbox\DEM toolbox directory where there are other programs concerning Variational Bayesian methods and Friston’s Free Energy Principle. At the heart of the most of these programs is either the spm_DEM function, or spm_ADEM (‘DEM with active inversion’).

Whilst SPM is freely available, Matlab is not. The near-alternative, Octave, does not work with SPM. So whilst I have had access a Matlab licence, I have logged results to text files so that I can examine the simulation results in Python – which is freely available, of course. This is a stepping stone towards understanding the Dynamic Expectation-Maximization algorithm (see reference ‘DEM: A variational treatment of dynamic systems’), that lies at the heart of these functions. The Dynamic Expectation-Minimization algorithm is an extension of the Expectation-Maximization algorithm that I have looked at previously in Gaussian Mixture Models and in K-means.

All-in-all this is a rather disappointing post! – and in so many ways:

• The code is very difficult to understand, because of the complex data structures.
• Logging the results does not reveal as much as I had hoped.
• There is only 1 simulation with a high level of activity (ADEM_Learning) to get much insight from.
• That simulation purports to have reinforcement learning-like behaviour (more on this in a later post) but it only deals with behaviour after
• Lastly, the pretty pictures at the end are the closest I’ll get to holiday snaps this year.

It does at least valuable clues to help understand the academic papers that use the spm_ADEM (most significantly in this and this), by showing use the nature (size, shape, content) of the variables that can be tied to the formulae in those papers. And they provide the basis for visualizing  any future Python-coded DEM algorithms.

The data and scripts provided (now on GitHub) allow you to extract information (possibly different information) from these simulation and plot in whatever way you want, without needing to run anything other than Python.

## The Dynamic Expectation-Minimization Algorithm

I only look at the spm_ADEM (‘DEM with active inversion’) function. The code that calls it in the various simulations looks innocent enough:

but the DEM variable is a huge structure – of arrays of structures of arrays. Some of the arrays are ‘cell arrays’ (each item in the array can have a different data type) and many arrays are ‘sparse’ (in which most values are zero so it is more efficient space-wise to store as a list). Many items within DEM are setup before the call and are not modified (i.e. they are input parameters). Many items are initially empty structures that are filled in (i.e. they are outputs). Many are both used and set (i.e. effectively: passed by reference rather than value).

The result is code that is very difficult for an outsider to understand when you don’t understand what the function itself is doing.

Thus, I have endeavored to see out what is going on inside DEM, resulting in the plots below. I have:

1. Updated spm12/toolbox/DEM/spm_ADEM.m to record (‘dump) most of the variables to a file after they have been assigned.
2. Ran this on most of the simulations for which graphical output is capture in the previous post.
3. Discovered that there is very little activity in most of the files and that ADEM_Learning simulation produces by far the largest ‘dump file’.
4. Examined these files in Linux (actually in Cygwin = a Linux terminal on a PC) to see where there is information worth plotting.
5. Post-processed the ADEM_Learning dump file to reduce its size by removing uninteresting information.
6. Running a Python program plot_adem.py to produce the graphs shown below.

I had also dumped top-level variables in a simulation e.g. the variables within spm12/toolbox/DEM/ADEM_Learning.m but there was very little activity there.

I have finally embraced GitHub as a repository for my code and data. You can access it here:

You don’t need a GitHub account to access the information. You can download the files as a Zip file but you can even avoid this – there are not so many files that you cannot download them separately.

This means that you can recreate the graphs shown, rotate the 3-D graphs yourself to see the data better, or plot the data differently without installing Matlab and SPM12.

The code works on the ADEM_Learning dumpfile (post-processed) but can be modified easily enough to work on the other (much smaller, less interesting) dumpfiles, only 3 of which have been uploaded so far. The ADEM_Learning dumpfile produced by Matlab (i.e. un-processed) is 1.7Gbytes!

The ‘ADEM_learning’ program, which has the relatively large level of activity, is described as follows in its comments banner:

“Value learning demo using the mountain car problem. This demo questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimize the behavior of an agent.  We show that it is fairly simple to teach an agent complicated and adaptive behaviors under the free energy principle.  This principle suggests that agents adjust their internal states and sampling of the environment to minimize their free energy.  In this context, free energy represents a bound on the probability of being in a particular state, given the nature of the agent, or more specifically the model of the environment an agent entails.  We show that such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. The result is a behavioral policy that reproduces exactly the policies that are optimized by reinforcement learning and dynamic programming. Critically, at no point do we need to invoke the notion of reward, value or utility.”

To repeat the snapshot outputs from the simulation… An additional disappointment is that it seems that the exploration of the environment (phase space: a 1-dimensional ‘position’ on the x axis and its velocity on the y-axis) has already been done. See far right picture (‘click car for movie’) for the altitude-versus-x-position curve. Starting at the bottom of the curve, the objective is to get to the stable (zero-gradient) position at the green dot (i.e. without rolling back down the mountain).

All that is left is to find a suitable trajectory in phase space: the mountain car must go backwards uphill to the left (smaller x position) to then gain enough momentum to get up the hill on the right (larger x position) to the stable point (green circle, above) – see how the red curve bottom-right spirals back and forth (x-axis position going backwards-left and forwards-right until the large arc upwards-faster and forwards-right to then spiral into a new stable position, higher up the mountain.

## A Cursory Look over the Code

spm_ADEM implements a variational Bayes (VB) scheme under the Laplace approximation to the conditional densities of states (u), parameters (p) and hyperparameters (h) of any analytic nonlinear hierarchical dynamic model, with additive Gaussian innovations.

It comprises three variational steps (D,E and M) that update the conditional moments of u, p and h respectively

• D: qu.u = max <L>q(p,h)
• E: qp.p = max <L>q(u,h)
• M: qh.h = max <L>q(u,p)

where qu.u corresponds to the conditional expectation of hidden states x and causal states v and so on.  L is the ln p(y,u,p,h|M) under the model M. The conditional covariances obtain analytically from the curvature of L with respect to u, p and h.

The D-step is embedded in the E-step because q(u) changes with each sequential observation.  The dynamical model is transformed into a static model using temporal derivatives at each time point.  Continuity of the conditional trajectories q(u,t) is assured by a continuous ascent of F(t) in generalised co-ordinates.  This means DEM can deconvolve online and represents an alternative to Kalman filtering or alternative Bayesian update procedures.

Note: F is the Free Energy.

At the highest level of description, the code:

1. Makes local copies of the DEM inputs [I am not familiar with Matlab: are these copies of the data or copies of the references?].
2. Performs some vector calculus operations (possibly this includes the ‘active inversion’?)
3. Iterations of the D-step, the E-step and the M-step in turn.
4. Copying results to the DEM

Here are the main functions that spm_ADEM call (some are Matlab; most are from SPM12):

• deal: for cross-assigning cell arrays and structure arrays (analogous to dealing cards in turn).
• kron: The Kronecker Tensor Product e.g. if two matrices A and B are both 2×2 matrices (containing elements ai and bj), the product K is a 4×4 array which containing every combination of ai x bj.
• spm_ADEM_set: performs checks on the DEM
• spm_cat: convert cell array into a matrix.
• spm_DEM_eval: [E,dE,f,g] = spm_DEM_eval(M,qu,qp): evaluates state equations and derivatives for DEM schemes
• spm_DEM_qU: display conditional estimates
• spm_DEM_z: ‘creates hierarchical innovations for generating data’: adds noise.
• spm_dx: [dx] = spm_dx(dfdx,f,t): ‘Integration of a dynamic system using local linearization’
• spm_inv: inverse for ill-conditioned matrices.
• spm_logdet: logarithm of the Determinant of a matrix.
• spm_speye: creates an Identity Matrix (zeros except 1s on the leading diagonal).
• spm_svd: a Singular Value Decomposition function that uses sparse arrays. is a generalization of Eigendecomposition which represents a matrix in terms of its eigenvalues and eigenvectors.
• spm_unvec: Unvectorize an array.
• spm_vec: ‘vectorize’ an array [?]
• trace: sum of elements on the leading diagonal

Some of this may stir faint recollections of matrix algebra from long ago.

The Jacobian matrix is a matrix contain the partial derivatives of every element of vector function F w.r.t every element of vector X. The element of the matrix J in row i, column j, Jij = ∂Fi/∂Xj. Whilst we are introducing matrix terminology, the Hessian matrix contains all the partial second-derivatives ∂Fi/∂X of a vector function and is the Jacobian of the gradient of the function: H(f(x)) = J(∇f(x)).

Priors and posteriors are terms (encountered previously) concerning Bayesian probability. ReML is the ‘Restricted Maximum Likelihood’. The mean (average) is the first moment and variance is the second moment.  When dealing with vectors (of size n), the mean is a vector of size n, and the covariance is an n x n vector of variances. Curvature concerns the second derivatives.

The variable naming convention used for first and second derivatives is that, for example, dy/dx and d2y/dx2 are the represented by variables dydx and dydxx respectively.

Functions added by me are disp_scalar, disp_vec, disp_array and disp_variable, the last one superceding the previous ones in being generic and being able to cope with all data types. They just print out variable values to the dump file.

## The Plots

The title of each plot gives some information about the information plotted (as far as my current understanding lets me) and indicates the variable on the y-axis (two-dimensional graphs) or z-axis (two-dimensional graphs). Axes labelled t refer to the DEM iteration number, and points are colored according to the spectrum from red (t=1) to blue (t=16). In some cases, only some t values are displayed, to help see what is going on. All the simulations use a fixed 16 iterations rather than detecting convergence but convergence typically occurs in 2 or 3 iterations. In some cases, not all t values are displayed for clarity.

Other axes are called X[:n] or Y[:n] where n indicates the size of the vector. For many X, n=512. For some X and Y, there are only a small number of integer datapoints but these are still plotted on a continuous axis. At this stage, there is no understanding of what these variables actually are!    ## Using SPM for Dynamic Causal Modelling of COVID-19

[Although this post concerns Artificial Intelligence, it is in the Neuroscience: Predictive Mind strand of this blogsite (see drop-down tabs) because of its context]

Recently, Karl Friston et al released the paper ‘Dynamic causal modelling of COVID-19’ which applies the variational Bayesian methods they normally apply to neuroscience to the coronavirus pandemic, using the country-by-country coronavirus data collected by Johns Hopkins University. See also the video here.

This was reported on in the Guardian newspaper, with a subsequent UnHerd LockdownTV interview:

Friston is a member of the ‘Independent SAGE’ committee – an alternative to the UK government’s  ‘Scientific Advisory Group for Emergencies’ (SAGE) committee, making open recommendations to the government: report here (PDF) with a video of the press conference.

The modelling in the paper uses the SPM (Statistical Parametric Mapping) software originally created by Friston for brain imaging analysis (of fMRIPETSPECTEEG and MEG) and this was subsequently made freely available at https://www.fil.ion.ucl.ac.uk/spm  (the Functional Imaging Laboratory at the Wellcome Centre for Human Neuroimaging, University College London). It is a toolbox for MATLAB.

The COVID19 modelling is now included within SPM as one of the many demos. Anyone can run this modelling, recreating the figures presented in the paper.

The demo is within the spm12\toolbox\DEM (Dynamical Expectation Maximization) directory. DEM is an extension of the Expectation-Maximization algorithm (referring back to previous posts, both K-means and Gaussian Mixture Models Machine Learning methods are based around the Expectation-Maximization algorithm). See ‘DEM: A variational treatment of dynamic systems’, for more information on DEM algorithms. The directory contains many other example demos concerning Variational Bayesian methods and the Free Energy Principle.

Here I:

• Describe how to re-create the results of the COVID-19 and other demos.
• Provide summaries of many of the demos, just extracted from banner comments of the demo scripts.
• Show many of the figures produced from the demos.

It provides a reference overview. It:

• showcases SPM for dynamical modelling (i.e. beyond its original purpose of being used for analysing brain imaging data).
• includes the birdsong, mountain car problem and occulomotor examples described in so many of Friston’s Free Energy Principle papers.
• shows the wealth of additional biological scenarios that the Free Energy Principle has been applied to.

But it doesn’t capture the animations and sounds that running the actual scripts does.

## Installation Options

Since SPM is a toolbox for MATLAB, we need to be able to run MATLAB scripts (programs). There are three options here:

1. Using Mathworks’ MATLAB with the SPM12 library. This is the preferred choice but Matlab will require a licence so that may preclude it,
2. Octave with the SPM12 library. Octave is an open source equivalent to MATLAB. See SPM/Octave wikibook page for more info.
3. Standalone SPM, which is built using MATLAB Compiler which does need a MATLAB licence (see limitations). See SPM/Standalone wikibook page for more info.

Unfortunately:

• Octave doesn’t completely work with SPM yet. But access to the Matlab scripts gives you a starting point to debug.
• I haven’t got the Standalone SPM working yet (watch this space for when my Matlab licence expires). Additionally, the COVID19 demo is not yet included in the Standalone deliverable.

The Latest version is SPM12. Access to older versions of SPM may be useful e.g. SPM8. I could only find the source for some fundamental script algorithms elsewhere e.g. the fundamental DEM script spm_ADEM is on neurodebian’s github.

## Installing Standalone SPM

Watch this space! (But I think you will need a Matlab licence to create the executable. Perhaps there is an executable to download from somewhere.)

# Glossary

For the descriptions of the non-COVID demo examples, here is a glossary on much of the biological terminology and some of the more advanced math/statistics. Normal terminology within the Free Energy Principle (e.g. Markov blanket, Kullback-leibler divergence) is not included.

• active inference: use of an internal generative model to predict incoming sensory data and improving the fit between this model and data by acting on the world which then changes the sensory data such that it is more consistent with the model.
• adiabatic: transfers energy to the surroundings only as work, not as heat.
• affordance: action possibilities of the environment
• allostatic: the process of maintaining homeostasis through the adaptive change of the organism’s internal environment
• attractor: see Lorenz attractor
• autovitiate: self-damage
• eigenmodes, eigenvalues, etc: in linear algebra and geometry, concerning transformations
• EKF: Extended Kalman filter – a non-linear Kalman filter
• empirical Bayes: statistical inference in which the prior distribution is estimated from the data (unlike standard Bayesian methods).
• entrainment: the processof making something have the same pattern or rhythm as something else
• epistemic foraging: searching for information
• ergodic: having the same statistical behavior averaged over time as over the system’sentire possible state space
• Event-related potential (ERP): a brain signal response to sensory input.
• exogenous: outside of the body of an organism
• ethological: concerning the study of animal behaviour
• F statistics: Statistics concerning the F-distribution family of probability distributions that, depending on parameters, can look like a Normal distribution or a Poisson distribution.
• flexion: increased bending of a limb, in contrast to ‘extension’ being the straightening.
• Fokker Planck equation: in statistical mechanics, a differential equation describing the time evolution of a probability density function.
• Generalised Linear Model (GLM): in statistics, generalizing linear regression to cover non-Normal probability distributions.
• Gabor patch: a series of black and white bars at any orientation for visual experiments
• haemodynamic: concerning blood flow / circulation.
• heteroclinic orbit: a path in phase space which joins two different equilibrium points
• infomax principle: optimization principle for artificial neural networks that maximizes the average Shannon mutual information between input and output
• Jacobian: in vector calculus, a matrix of partial derivatives
• Kalman filter: calculates Bayesian joint probability distribution estimates of the true values from a noisy series of measurements over time.
• Lagrangian: in mathematical optimization, a function used to solve constrained minimization problems. Used in the calculus of variations.
• Laplace assumption: assuming a Gaussian distribution. assuming a Gaussian distribution
• Lorenz attractor: in dynamical systems, a point in state space that a system will tend to evolve towards
• Lotka-Volterra dynamics: (also known as the predator–prey equations): involving nonlinear differential equations that are frequently used to describe the dynamics of biological systems in which two species interact
• Lyapunov function: a function that may be used to prove the stability of an equilibrium of an ordinary differential equation.
• Markov Decision Process (MDP): a discrete-time stochastic (partly-random) control process for optimizing actions of an agent in an environment.
• Mismatched negativity (MMN): the component of a brain signal (ERP) to something atypical in an input sequence
• nosology: the study of the classification of diseases.
• occulomotor: motion of the eye
• occlusion: hiding in order to study how people use visionto anticipate the best course of action
• Ornstein-Uhlenbeck process: a stochastic Gaussian Markov process like a random walk in continuous time in which there is an attraction back towards a central location.
• parametric empirical Bayes: empirical Bayes in which the likelihood and prior are defined in simple parametric forms e.g. Normal distribution (mean and variance).
• parcellation scheme: defining distinct partitions in the brain.
• perseveration: pathological persistent repetition of word, gesture, or act.
• principle of minimum redundancy: Horace Barlow’s ‘efficient coding hypothesis’ that aims to minimize Information Theory ‘redundancy’ (‘wasted space’ in transmission).
• proprioceptive: of the awareness of the position of the various parts of the agent’s body .
• pursuit: (slower) smooth eye movement
• P300: a brain signal response (ERP) occurring not because of a stimulus but to a person’s reaction to it (decision).
• Restricted Maximum Likelihood (ReML): in statistics, a form of maximum likelihood estimation in which the likelihood function is calculated from a transformed set of data to produce unbiased estimates of variance.
• salience: attentional mechanism, for items to stand out
• signal detection theory: ability to differentiate between stimulus and noise.
• SSE: sum of squared error
• syrinx: the vocal organ of birds
• Wiener process: Brownian motion

# List of Demos

Below are descriptions of many of the DEM demos (‘DEM’ as in ‘Dynamic Expectation Maximization’, not as in ‘DEMo’) in SPM12 and a capture of much of the raphical output from them. The list below is the order they are presented with an attempt made to group these logically. To run each demo, just type the demo name since the demo script is .m file e.g.:

>> DEM_COVID

The demos are…

COVID-19 Simulations:

• DEM_COVID

• ADEM_occlusion (Slow pursuit and occlusion under active inference)
• ADEM_pursuit (Slow pursuit under active inference)
• DEM_demo_Gabor (generative model predicts Gabor patch moving in visual field)

Motor:

• ADEM_writing (prediction errors to movement trajectories)

The Mountain Car Problem:

• ADEM_learning (Value learning using the mountain car problem)

Bird Song:

• DEM_demo_song_inference (Perceptual categorisation of bird songs)
• DEM_demo_song_omission (deconvolving dynamic changes in its control parameters)
• DEM_demo_song_priors (simulating local field potential)
• DEM_demo_SOC (self organised criticality)
• DEM_demo_MMN (simulate chirps and mismatch negativity)

The Free Energy Principle:

• FEP_fluctuations (the Bayesian perspective)
• FEP_Manifold (emergence of life in terms of active inference)
• FEP_MB_demo (hierarchical decomposition of Markov blankets)
• FEP_physics (different perspectives: quantum, statistical and classical mechanics)
• FEP_self_entropy (self organisation in terms of the entropy of blanket states)
• DEM_morphogenesis (self-assembly under active inference)
• DEMO_Lindley_paradox (Bayes factors and classical p-values)

Markov Decision Processes:

• DEM_MDP_decision (agent moves from undecided state to definitive choice)
• DEMO_MDP_questions (2-level simulation of reading with epistemic foraging)
• DEMO_MDP_voice (simulating reading for deep temporal generative models)
• DEMO_MDP_maze (continuous and discrete state space modelling)
• MDP_DEM_Mixed_Models_Movement (examination of the biceps reflex)

Dynamic Causal Modelling:

• DEMO_DCM_MB (Markov blankets)
• DEMO_DCM_PEB (group DCM for electrophysiology)
• DEMO_BMR_PEB (Empirical Bayes and Bayesian model reduction)

Prognosis and Prediction:

• DEM_demo_ontology (generative model for computational nosology)

Miscellaneous / Basic:

• KLDemo (Illustration of information gains with Bayesian fusion)
• ADEM_lorenz (learning the parameters of a Lorenz attractor)
• DEMO_BAYES_FACTORS (Bayes factors and classical p-values)
• DEM_FEP_Least_Action
• DEM_self_entropy (self organisation in terms of entropy of blanket states)
• DEM_path_integrals (approximations to path integrals – Fokker Planck)
• DEM_demo_GF_and_KF (generalised and Kalman filtering)
• DEM_demo_OU (linear deconvolution – not actually Ornstein-Uhlenbeck process!)
• DEM_demo_GLM (comparing DEM and ReML under a general linear model)
• DFP_demo_double_well (comparing Variational filtering with particle filtering)

# COVID-19 Simulations

## DEM_COVID

The simulation requires COVID-19 data from Johns Hopkins University CSSE (Center for Systems Science and Engineering). Go to https://github.com/CSSEGISandData/COVID-19. If you have a github account, you can go to ‘clone or download’ to download the whole database as COVID-19-master.zip. But you only actually need 3 files within https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series:

• csv
• csv
• csv

Setup an account if you don’t have one! Or if you don’t want to or don’t want to download the whole database, go to the specific files and right-click on ‘raw’ and ‘save-as’ (use .csv file extension). Simply (and dirtily) save the files in the SPM12/toolbox/DEM directory.

DEM_COVID generates all of the figures in the COVID-19 paper (with minor differences) Figures 3, 4, 5 and 6 Figures 7, 8, 9 and 10 Figures 11, 12, 13 and an additional figure

# Vision: Occlusion, Pursuit and Saccades

## ADEM_occlusion (Slow pursuit and occlusion under active inference)

This demo illustrates slow pursuit in the context of visual occlusion. We start with a simulation of canonical slow pursuit of a visual target with sine wave motion. Crucially, the generative model is equipped with a simple empirical prior encoding the hidden motion of the target (using a linear oscillator, whose frequency is determined by a hidden cause).  We then examine failures of tracking and anticipation during occlusion and when the target re-emerges from behind the occluder. We look at a simulation in which the precision of the oscillator dynamics modelling long-term behaviour of the target is reduced (cf., neuromodulatory deficits in cortical areas encoding biological motion). This has an effect of producing a failure of pursuit, resulting in a catch-up saccade on target reappearance. The suppression of prior precision can however have beneficial effects when motion is itself unpredicted (as shown with differential pursuit performance under a reversal of the trajectory towards the end of motion). Finally, we look at how prior beliefs are acquired during exposure to the target – in terms of cumulative inference on the hidden causes encoding the frequency of periodic motion. This can be regarded as a high order form of evidence accumulation. Importantly, this (experience-dependent) inference is markedly impaired by the simulated lesion to precision above. In other words, a single failure of inference in modelling the motion of hidden states can have secondary consequences – such as a failure to even register and remember regularities. All these simulations are based upon active inference; with the prior belief that the centre of gaze is attracted to the same point responsible for target motion.  This demo illustrates oculomotor following and slow pursuit. The focus here is on oculomotor delays and their compensation in generalised coordinates of motion. This is illustrates using a ‘sweep’ of motion to examine the effects of motor delays, sensory delays and their interaction under active inference. We then move on to oculomotor following of sine wave motion, in which the trajectory entrains following under compensated dynamics. This entrainment can be destroyed by rectifying the sine wave which calls for a more realistic (hierarchical) model of motion that registers its phase and anticipates the next onset of motion (cf movement behind an occluder). These simulations depend delicately on delays and precisions (gains) that are chosen to make oculomotor following under uncertainty relatively slow. The dependency on visual uncertainty (contrast) is illustrated by changing the precision of the generalised motion of the sensory target.  ## ADEM_pursuit (Slow pursuit under active inference)

This demo illustrates slow pursuit eye movements under active inference.  Its focus is on frames of references and the entrainment of gaze-direction by the motion of a visual target. The generative process (and model) is based upon the itinerant trajectory of a target (in Cartesian coordinates) produced by Lotka-Volterra dynamics. The agent expects its sampling (in polar coordinates) to be centred on the target. Here, the agent is equipped with a model of the trajectory and the oculomotor plant. This means it represents both the location of the target and the mapping from target location (in relation to a fixation point) to egocentric polar coordinates. We simulate behavioural (saccadic) and electrophysiological (ERP) responses to expected and unexpected changes in the direction of a target moving on the unit circle. The agent expects the target to reverse its direction during the trajectory but when this reversal is omitted (and the target persists in a clockwise direction) violation responses are emitted. This demo illustrates exploration or visual search in terms of optimality principles based on straightforward ergodic or allostatic principles. In other words, to maintain the constancy of our external milieu, it is sufficient to expose ourselves to predicted and predictable stimuli. Being able to predict what is currently seen also enables us to predict fictive sensations that we will experience from another viewpoint. This provides a principled way in which to explore and sample the world for example with visual searches using saccadic eye movements. These theoretical considerations are remarkably consistent with a number of compelling heuristics; most notably the Infomax principle or the principle of minimum redundancy, signal detection theory and recent formulations of salience in terms of Bayesian surprise. The example here uses saliency (the posterior precision associated with fictive sampling of sensory data) to simulate saccadic eye movements under active inference. DEM demo for active inference (i.e. action-perception optimisation of free energy).  This simulation calls on spm ADEM to simulate visual sampling of the world and demonstrate retinal stabilisation or visual tracking. We simulate a simple 2-D plaid stimulus and subject it to an exogenous perturbations. By employing tight and broad priors on the location of the stimulus, we can show that action does and does not explain away the visual consequences of the perturbation (i.e., the movement is seen or not).  This illustrates how one can reframe stabilisation or tracking in terms of sampling sensory input to ensure conditional expectations are met; and how these expectations are shaped by prior expectations. ## DEM_demo_Gabor (generative model predicts Gabor patch moving in visual field)

State-space demo routine simulating position invariant representations in the visual system.  The generative model predicts a one-dimensional Gabor patch that moves in a (one-dimensional) visual field. The inversion of this dynamic model can be viewed as deconvolving spatial and category attributes from a moving stimulus (or selective re-sampling of the input) to recover the stimulus that can be represented. The prediction shown in the lower panels had position information removed. # Motor

This demo illustrates how action can fulfil prior expectations by explaining away sensory prediction errors prescribed by desired movement trajectory. It is based on the same linear convolution model of the motor plant considered in the visual tracking example. Here, we induce prediction errors; not through exogenous perturbation to sensory input but through tight priors encoding a desired or expected trajectory. We then show how the movement is robust to changes in the true motor dynamics and other exogenous perturbations, late in movement execution This demo illustrates how action can fulfil prior expectations by explaining away sensory prediction errors prescribed by desired movement trajectories. In this example a two-joint arm is trained to touch a target so that spontaneous reaching occurs after training. ## ADEM_writing (prediction errors to movement trajectories)

This demo illustrates how action can fulfill prior expectations by explaining away sensory prediction errors prescribed by desired movement trajectories. In this example a two-joint arm follows a stable heteroclinic channel, prescribed by a set of fixed-point attractors. The locations of the successive (autovitiated) attractors are defined by parameters. The ensuing trajectories are illustrated here in terms of synthetic writing. This demo illustrates cued sequential movements. It uses active inference under a hierarchical generative model of sequential cues and consequent movements. The agent has a (second level) model of (two) contingencies or contexts; these correspond to the sequential appearance of targets in a clockwise direction. The other context has no sequential aspect. The first level model is contextually modulated to produce the appropriate sequence of (location – specific) affordances, which predict both visual and proprioceptive consequences. This is sufficient to engender cued reaching movements, which are slightly anticipatory if the agent infers the correct probabilistic context. However, if we reverse the order of the stimuli there is an accuracy and reaction time cost, due to the fact that the sequence is unpredictable.  Furthermore, there is a set switching cost as the hidden states at the second (contextual) level are inferred. This provides a simple but very rich model of cued reaching movements and set switching that is consistent with notions of salience and affordance. Furthermore, we can simulate Parkinsonism by reducing the precision of affordance – based cues. These are the visual attributes that confer saliency on the current target. Reducing this precision (for example, dopamine) delays and can even preclude set switching, with associated costs in pointing accuracy. By completely removing the precision of the salience or affordance cues, we obtain autonomous behaviour that is prescribed by the itinerant expectations of the agent. This can be regarded as perseveration in a pathological setting or the emission of autonomous behaviour in the absence of any precise sensory information  This demonstration routine simulates the Posner paradigm to show that some of the characteristic speed-accuracy trade-offs associated with valid and invalid cueing can be explained easily in terms of optimizing precisions during hierarchical inference. This demonstration uses generalised filtering and state-space model that includes state-dependent noise. Here, this dependency is used to set the attentional gain or bias using a cue, which modulates the prediction errors induced by subsequent targets. The phenomena that emerge from this scheme include a competition for attentional resources; given that only one context can exist at any time and this probabilistic context is encoded by state-dependent precisions on the causes of sensory input. Attended stimuli have greater precision and greater penetration of their prediction errors in the hierarchy. We will also see characteristic differences between perceptual inference, under valid and invalid cues. This is illustrated using simulated psychophysical and electrophysiological responses. Biased competition is simulated by presenting both valid and invalid targets simultaneously. # The Mountain Car Problem

The Mountain Car Problem is a classic test example in reinforcement learning. It must be learnt that the car must go backwards uphill to build up potential energy before accelerating forwards in order to get up the steep slope.

## ADEM_learning (Value learning using the mountain car problem)

Value learning demo using the mountain car problem. This demo questions the need for reinforcement learning and related paradigms from machine-learning, when trying to optimise the behaviour of an agent.  We show that it is fairly simple to teach an agent complicated and adaptive behaviours under the free energy principle.  This principle suggests that agents adjust their internal states and sampling of the environment to minimize their free energy.  In this context, free energy represents a bound on the probability of being in a particular state, given the nature of the agent, or more specifically the model of the environment an agent entails.  We show that such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. The result is a behavioural policy that reproduces exactly the policies that are optimised by reinforcement learning and dynamic programming. Critically, at no point do we need to invoke the notion of reward, value or utility. This demo re-visits the mountain car problem to show that adaptive (desired) behaviour can be prescribed in terms of loss-functions (i.e reward functions of state-space). It exploits the fact that under the free energy formulation, loss is divergence. This means that priors can be used to make certain parts of state-space costly (i.e. with high divergence) and others rewarding (low divergence). Active inference under these priors will lead to sampling of low-cost states and (apparent) attractiveness of those states. This demo re-visits the mountain car problem to show that adaptive (desired) behaviour can be prescribed in terms of loss-functions (i.e. reward functions of state-space). It exploits the fact that under the free energy formulation, loss is divergence. This means that priors can be used to make certain parts of state-space costly (i.e. with high divergence) and others rewarding (low divergence). Active inference under these priors will lead to sampling of low-cost states and (apparent) attractiveness of those states. # Bird Song

## DEM_demo_song_inference (Perceptual categorisation of bird songs)

Perceptual categorisation of bird songs: The generative model of birdsong used in this simulation comprises a Lorenz attractor with two control parameters (or hidden causes), which, in turn, delivers two control parameters to a synthetic syrinx to produce ‘chirps’ that are modulated in amplitude and frequency.  The chirps were then presented as a stimulus to a synthetic bird to see if it could infer the underlying causal states and thereby categorise the song. This entails minimising free energy by changing the internal representation of the control parameters. Each simulated song comprises a series of chirps whose frequency and number fall progressively from song a to song c, as a causal state (known as the Raleigh number) is decreased.  The simulations show that the causes are identified after about 600 milliseconds with high conditional precision (90% confidence intervals are shown in grey). These simulations illustrate the nature of perceptual categorisation under generalised predictive coding: Here, recognition corresponds to mapping from a continuously changing and chaotic sensory input to a fixed point in perceptual space.

The various bird songs can be played by right clicking on the sonogram images, after the routine has completed. ## DEM_demo_song_omission (deconvolving dynamic changes in its control parameters)

Demo for bird song: In this example, we show that DEM can not only estimate the hidden states of an autonomous system but can also deconvolve dynamic changes in its control parameters.  We illustrate this using a slow Lorenz attractor to drive a fast one; both showing deterministic chaos.  We endow the simulations with a little ethological validity by using the states of the fast Lorenz attractor as control variables in the syrinx of a song bird (usually these would control a van der Pol oscillator model). We will look at the true and inferred songs with and without the last chirps missing.  The sonograms displayed can be played by a mouse click on the image.  Subsequent plots show simulated event-related potential to show that there is a marked responses (prediction error) of the system when an expected ‘syllable’ is omitted. This demonstrates the implicit sequence-decoding of input streams, using generative models based upon attractors. Having simulated normal omission-related responses, we then reduce the precision at the second level (on both hidden causes and states) and repeat the simulation. The result is an attenuation of the omission- related response or mismatch negativity. If we try to compensate by reducing the sensory precision, then the autonomous dynamics predicting the sequence of chirps supervenes, producing false inference. This can be thought of as a – crude – model of hallucinosis. ## DEM_demo_song_priors (simulating local field potential)

Demo for bird song: In this example, we simulate local field potential using the prediction error from the song-bird example below. We look at these responses under natural stimuli and after removing the second level of the hierarchy to show it is necessary for veridical perception. We then repeat but omitting dynamical priors by forsaking generalised coordinates ## DEM_demo_SOC (self organised criticality)

Demo for bird song: this routine illustrates self organised criticality in terms of stimulus induced bifurcations and weak synchronisation of recognition (neuronal) dynamics. It uses the birdsong example, where stimuli are generated using a Lorenz attractor and modelled with the same attractor, with state dependent parameters. These control parameters are categorised in terms of a softmax function of point attractors in a (two-dimensional) perceptual space. We examine the self organised criticality in terms of Lyapunov exponents and the free energy – as a function of precision of the motion of hidden states. ## DEM_demo_MMN (simulate chirps and mismatch negativity)

This Demo uses the linear convolution model of previous examples to simulate chirps (c.f., the bird song demos). By presenting a train of chirps and changing the stimulus after a couple of presentations, we can simulate a roving oddball paradigm used in ERP research. Critically, we hope to see a more exuberant response to the first presentation of a novel chirp (oddball) relative to the same stimulus after learning (standard).  The simulation shows that although veridical percepts obtain from variational de-convolution, the prediction error continues to fall with repetition (as the parameters are optimised). This repetition suppression subtends a mismatch response that has many of the characteristics of the well-known mismatch negativity (MMN).

# The Free Energy Principle

## FEP_fluctuations (the Bayesian perspective)

This demonstration uses an ensemble of particles with intrinsic (Lorenz attractor) dynamics and (Newtonian) short-range coupling.  The focus of this routine is to unpack the Bayesian perspective. We first simulate dynamics to nonequilibrium steady-state, identify the Markov blanket and then examine the encoding of external states by internal states; in terms of their expected values.

The crucial aspect of this implicit inference (and the basis of the free energy principle) is the existence of a conditional synchronisation manifold, when conditioning internal and external states on the Markov blanket. This provides the basis for a mapping between internal and external states that can be interpreted in terms of a probabilistic representation or inference.

This Bayesian perspective is illustrated in terms of a mapping between the canonical modes of internal and external states (as approximated with a polynomial expansion). The canonical modes here are evaluated using an estimate of the conditional expectations based upon the Euclidean proximity of Markov blanket states. The ensuing posterior over external states is then illustrated, in relation to the actual external states. We also simulate event related potentials by identifying several points in time when the Markov blankets revisit the same neighbourhood. Finally, to illustrate the underlying dynamics, the Jacobians or coupling among internal and external states are presented; using different orders of coupling (i.e., degrees of separation) ## FEP_Manifold (emergence of life in terms of active inference)

This demonstration routine simulates the emergence of life – as defined in terms of active inference – using a synthetic primordial soup. The key aspect of this dynamics is that there is a separation between dynamical states and structural states; where the dynamical states of the microsystem are equipped with a Lorenz attractor and the structural states correspond to position and velocity. The flow of structural states conforms to classical Newtonian mechanics. Crucially, the physical motion of each microsystem is coupled to its internal dynamics and vice versa; where the coupling among dynamics rests upon short range (electrochemical) forces. This means that the dependencies among the dynamics of each microsystem dependent on their positions. This induces a dependency of the systems structural integrity on its internal dynamics – which leads to biological self-organisation. This biological self-organisation is illustrated in terms of the following:

1. i) the existence of a Markov blanket that separates internal and external states, where the internal states are associated with a system that engages in active or embodied inference.
2. ii) emergent inference is demonstrated by showing that the internal states can predict the external states, despite their separation by the Markov blanket.

iii) this inference (encoded by the internal dynamics) is necessary to maintain structural integrity, as illustrated by simulated lesion experiments, in which the influence of various states are quenched. ## FEP_MB_demo (hierarchical decomposition of Markov blankets)

This routine illustrates a hierarchical decomposition of Markov blankets (of Markov blankets). It rests upon the dual operators of finding a partition (a Markov partition) and then using an adiabatic dimensional reduction (using the eigensolution of the Markov blanket). In brief, this means the states of particles at the next level become mixtures of the Markov blanket of particles at the level below.

The ensuing hierarchical decomposition is illustrated in terms of Jacobians and locations in a scaling space (evaluated using the graph Laplacian). This demonstration uses a fictive Jacobian that is created by hand – or the equivalent Jacobian of a synthetic soup (i.e., active matter) ## FEP_physics (different perspectives: quantum, statistical and classical mechanics)

This demonstration uses an ensemble of particles with intrinsic (Lorenz attractor) dynamics and (Newtonian) short-range coupling. the setup is used to solve for dynamics among an ensemble of particles; from which a Markov blanket emerges (which forms a particle at the next hierarchical scale. These ensemble dynamics are then used to illustrate different perspectives; namely, those afforded by quantum, statistical and classical mechanics. A detailed description of each of these three treatments precedes each of the sections in the script. these descriptions are in the form of a figure legend, where each section is summarised with a figure.  ## FEP_self_entropy (self organisation in terms of the entropy of blanket states)

This demonstration uses an ensemble of particles with intrinsic (Lorenz attractor) dynamics and (Newtonian) short-range coupling.  This routine illustrates self-organisation in terms of the entropy of blanket states (and concomitant changes in terms of mutual information (i.e., complexity cost or risk). Here, the ensemble average of these entropy measures is taken over all (128) particles of macromolecules; where the Markov blanket of each particle comprises all but the third (electrochemical) hidden state. The graphics produced by this routine simply plot the decrease in blanket entropy (and complexity cost) as the system approaches its random dynamical attractor. Illustrative trajectories of the particles are provided at three points during the (stochastic) chaotic transient.

## DEM_morphogenesis (self-assembly under active inference)

This routine illustrates self-assembly or morphogenesis under active inference (free energy minimisation).  It exploits the fact that one can express a systems (marginal) Lyapunov function in terms of a variational free energy.  This means that one can prescribe an attracting set in terms of the generative model that defines variational free energy.  In this example, the attracting set is a point attractor in the phase space of a multi-celled organism: where the states correspond to the location and (chemotactic) signal expression of 16 cells.  The generative model and process are remarkably simple; however, the ensuing migration and differentiation of the 16 cells illustrates self-assembly – in the sense that each cell starts of in the same location and releasing the same signals.  In essence, the systems dynamics rest upon each cell inferring its unique identity (in relation to all others) and behaving in accord with those inferences; in other words, inferring its place in the assembly and behaving accordingly.  Note that in this example there are no hidden states and everything is expressed in terms of hidden causes (because the attracting set is a point attractor) Graphics are produced illustrating the morphogenesis using colour codes to indicate the cell type – that is interpreted in terms of genetic and epigenetic processing. ## DEMO_Lindley_paradox (Bayes factors and classical p-values)

This demonstration routine uses a simple linear model to examine the relationship between free energy differences of log Bayes factors and classical F statistics.

[error in script]

# Markov Decision Processes

## DEM_MDP_decision (agent moves from undecided state to definitive choice)

This routine illustrates the treatment of signal detection paradigms in the context of active inference and Markov decision processes. This is probably one of the simplest paradigms to model; in which there are just two hidden states generating ambiguous stimuli – and the agent moves from an undecided (hidden) state to a definitive choice. The A tensor in this instance encodes ambiguity (perceptual noise), while the B matrix encodes the behaviour-dependent transitions among decision states. Finally, the C matrix encodes prior costs or preferences. In this instance, the agent does not want to be wrong – and prefers to be right.

In what follows, we simulate a single trial to illustrate the underlying Bayesian belief updating and associated behavioural and physiological responses. We then consider multiple trials under different levels of ambiguity and cost. The dependent measures in this instance include the behavioural accuracy, reaction times (assuming 250 ms time bins) and the uncertainty about the cause of sensory cues and control – as measured by the entropy of posterior beliefs prior to making a choice. ## DEMO_MDP_questions (2-level simulation of reading with epistemic foraging)

This routine provide simulations of reading to demonstrate deep temporal generative models. It builds upon the scene construction simulations to equip the generative model with a second hierarchical level. In effect, this creates an agent that can accumulate evidence at the second level based upon epistemic foraging at the first. In brief, the agent has to categorise a sentence or narrative into one of two categories (happy or sad), where it entertains six possible sentences. Each sentence comprises four words, which are themselves constituted by two pictures or graphemes These are the same visual outcomes used in previous illustrations of scene construction and saccadic searches.

Here, the agent has policies at two levels. The second level policy (with just one step into the future) allows it to either look at the next word or stay on the current page and make a decision. Concurrently, a first level policy entails one of four saccadic eye movements to each quadrant of the current page, where it will sample a particular grapheme.

This provides a rough simulation of reading – that can be made more realistic by terminating first level active inference, when there can be no further increase in expected free energy (i.e., all uncertainty about the current word has been resolved). The subsequent inferred hidden states then become the outcome for the level above.

To illustrate the schemes biological plausibility, one can change the agent’s prior beliefs and repeat the reading sequence under violations of either local (whether the graphemes are flipped vertically) or globally (whether the sentence is surprising) expectations. This produces a mismatch negativity (MMN) under local violations and a MMN with a P300 with global violations.  ## DEMO_MDP_voice (simulating reading for deep temporal generative models)

This routine provides simulations of reading to demonstrate deep temporal generative models. It builds upon the scene construction simulations to equip the generative model with a second hierarchical level. In effect, this creates an agent that can accumulate evidence at the second level based upon epistemic foraging at the first. In brief, the agent has to categorise a sentence or narrative into one of two categories (happy or sad), where it entertains six possible sentences. Each sentence comprises four words, which are themselves constituted by two pictures or graphemes These are the same visual outcomes used in previous illustrations of scene construction and saccadic searches.

Here, the agent has policies at two levels. The second level policy (with just one step into the future) allows it to either look at the next word or stay on the current page and make a decision. Concurrently, a first level policy entails one of four saccadic eye movements to each quadrant of the current page, where it will sample a particular grapheme.

This provides a rough simulation of reading – that can be made more realistic by terminating first level active inference, when there can be no further increase in expected free energy (i.e., all uncertainty about the current word has been resolved). The subsequent inferred hidden states then become the outcome for the level above.

To illustrate the schemes biological plausibility, one can change the agent’s prior beliefs and repeat the reading sequence under violations of either local (whether the graphemes are flipped vertically) or globally (whether the sentence is surprising) expectations. This produces a mismatch negativity (MMN) under local violations) and a MMN with a P300 with global violations.

[requires the Matlab Signal Processing Toolbox]

## DEMO_MDP_maze (continuous and discrete state space modelling)

This demonstration of active inference focuses on navigation and planning in a fairly complicated maze. The idea is to demonstrate how epistemic foraging and goal (target) directed behaviour are integrated in the minimisation of expected free energy. In this illustration, and 8 x 8 maze is learned through novelty driven evidence accumulation – to learn the likelihood mapping between hidden states (locations in the maze) and outcomes (whether the current location is open or closed). This accumulated experience is then used to plan a path from a start to an end (target location) under a task set specified by prior preferences over locations. These priors are based upon a simple diffusion (CF backwards induction) heuristic that specifies subgoals. The subgoals (i.e., locations) contain the most paths from the target within the horizon of the current policy. We will first illustrate the novelty driven epistemic foraging that efficiently scans the maze to learn its structure. We then simulate planning of (shortest path) trajectory to the target under the assumption the maze has been previously learned. Finally, we consider exploration under prior preferences to simulate behaviour when both epistemic and goal directed imperatives are in play. The focus on this demo is on behavioural and electrophysiological responses over moves. A key aspect of this formulation is the hierarchical decomposition of goal directed behaviour into subgoals that are within the horizon of a limited policy – here, to moves that correspond to a trial. The prior preferences then contextualise each policy or trial to ensure that the ultimate goal is achieved. Empirically, this sort of construction suggests the existence of Path cells; namely, cells who report the initial location of any subsequence and continue firing until the end of the local path. This is illustrated by plotting simulated activity as a function of trajectories during exploration.   ## MDP_DEM_Mixed_Models_Movement (examination of the biceps reflex)

This demo illustrates a series of computational pathologies as elicited during a synthetic neurological examination. This focuses upon an examination of the biceps reflex, and a simple coordination task. Each of these are simulated for an arm that may move in three dimensions through internal and external rotation of the shoulder, flexion and extension of the shoulder, and flexion and extension of the elbow. These dynamics play out through an active Bayesian filtering scheme, where a series of attracting points draw the hand to different locations in 3D space. The selection of these attracting points involves a hierarchical Markov Decision Process, which identifies these sequences based upon the prior belief that (1) the sequence will minimise expected free energy and (2) the sequence is consistent with the trajectory predicted by the highest (contextual) level of the model.  # Dynamic Causal Modelling

## DEMO_DCM_MB (Markov blankets)

This demo uses the notion of Markov blankets and the renormalisation group to evaluate the coupling among neuronal systems at increasing spatial scales. The underlying generative model is based upon the renormalisation group: a working definition of renormalization involves three elements: vectors of random variables, a course-graining operation and a requirement that the operation does not change the functional form of the Lagrangian. In our case, the random variables are neuronal states; the course graining operation corresponds to the grouping (G) into a particular partition and adiabatic reduction (R) – that leaves the functional form of the dynamics unchanged.

Here, the grouping operator (G) is based upon coupling among states as measured by the Jacobian. In brief, the sparsity structure of the Jacobian is used to recursively identify Markov blankets around internal states to create a partition of states – at any level – into particles; where each particle comprises external and blanket states. The ensuing reduction operator (R) eliminates the internal states and retains the slow eigenmodes of the blanket states. These then constitute the (vector) states at the next level and the process begins again.

This routine starts using a simple form of dynamic causal modelling applied to the principal eigenvariate of local parcels (i.e., particles) of voxels with compact support. The Jacobian is estimated using a linearised dynamic causal (state space) model, where observations are generated by applying a (e.g., haemodynamic) convolution operator to hidden (e.g., neuronal) states. This estimation uses parametric empirical Bayes (PEB: spm_PEB). The ensuing estimates of the Jacobian (i.e., effective connectivity) are reduced using Bayesian model reduction (BMR: spm_dcm_BMR_all) within a bespoke routine (spm_dcm_J).

The Jacobian is then partitioned using the course graining operator into particles or parcels (using spm_markov blanket). The resulting partition is then reduced by eliminating internal states and retaining slow eigenmodes with the largest (real) eigenvalues (spm_A_reduce). The Jacobian of the reduced states is then used to repeat the process – recording the locations of recursively coarse-grained particles – until there is a single particle.

The result of this recursive decomposition (i.e., renormalisation) affords a characterisation of directed coupling, as characterised by a complex Jacobian; namely, a multivariate coupling matrix, describing the coupling between eigenmodes of Markov blankets at successive scales. This can be regarded as a recursive parcellation scheme based upon effective connectivity and a generative (dynamic causal) model of multivariate (neuronal) timeseries.

## DEMO_DCM_PEB (group DCM for electrophysiology)

This routine illustrates the use of Bayesian model reduction when inverting hierarchical (dynamical) models; for example, multisubject DCM models. In this context, we have hierarchical models that are formally similar to parametric empirical Bayesian models – with the exception that the model of the first level can be nonlinear and dynamic. In brief, this routine shows how to finesse the brittleness of Bayesian model comparison at the single subject level by equipping the model with an extra (between subject) level. It illustrates the recovery of group effects on modulatory changes in effective connectivity (in the mismatch negativity paradigm) – based upon real data.

First, an EEG DCM (using empirical grand mean data) is inverted to find plausible group mean parameters. Single subject data are then generated using typical within and between subject variance (here, group differences in the modulation of intrinsic connectivity. We then illustrate a variety of Bayesian model averaging and reduction procedures to recover the underlying group effects.

## DEMO_BMR_PEB (Empirical Bayes and Bayesian model reduction)

This routine illustrates the use of Bayesian model reduction when inverting hierarchical (linear) models – it is essentially a software validation demo and proof of concept. It uses a parametric empirical Bayesian model (i.e., nested linear models) to eschew local minima issues and to assure the Laplace assumption is correct. In brief, the data are generated for multiple subjects, under a linear model with subject specific parameters at the first level and group specific parameters at the second. These model a group effect common to all subjects in a subset of parameters and differences in a further subset. The objective of empirical Bayesian inversion is to recover the group effects in terms of posteriors and perform Bayesian model comparison at the second (between subject) level.

This provides empirical shrinkage priors at the first level, which can be used to compute the predictive posterior for any subject. In turn, the predictive posterior can be used for leave-one-out cross validation.

The key aspect of this approach to empirical Bayesian modelling is that we use Bayesian model reduction throughout. In other words, after the subject-specific models have been inverted the data are discarded and we deal only with the free energies and posteriors for subsequent hierarchical analysis. This can be computationally very efficient when dealing with large first-level or complicated models: as in DCM.

The parameterisation of the models uses the format of DCM. This means parameters are specified as a structure with key parameters being in the fields A, B and C.

# Prognosis and Prediction

## DEM_demo_ontology (generative model for computational nosology)

This demonstration routine illustrates how a generative model can be used to furnish a computational nosology. In brief, it generates symptoms and diagnostic profiles from hidden or latent exogenous causes (e.g., therapeutic interventions) that are mediated by latent (pathophysiological and psychopathological) states.  Pathophysiological trajectories are modelled with a Lorenz attractor that (with a linear mapping) produces (two-dimensional) psychopathology. In turn, the psychopathological states generate symptoms (with a non-linear function of linear mixtures) and diagnostic outcomes (with a softmax function of diagnostic potential). The psychopathological state of a subject is associated with a diagnostic potential in terms of its Euclidean distance from disease categories (locations in the associated state space).

We start by simulating a relapsing-remitting disease process and then infer the latent states and parameters of a particular subject. This is then repeated in the setting of a therapeutic intervention. The demonstration then briefly considers model identification and selection by focusing on the mapping between pathophysiology and psychopathology. Finally, we consider, prognosis and prediction by estimating subject-specific parameters prior to therapy and then predicting putative response in the future, based upon a posterior predictive density.  # Miscellaneous / Basic

## KLDemo (Illustration of information gains with Bayesian fusion)

This routine illustrates the benefit of multimodal or Bayesian fusion in terms of conditional dependencies among parameters. In other words, it shows that even if one data modality contains no information about a particular set of parameters, it can help resolve uncertainty about another set and thereby disclose information contained in the other modality. This is illustrated here using a simple linear model with neuronal and haemodynamic parameters to show that EEG can provide some information gain, in relation to haemodynamic parameters. ## ADEM_lorenz (learning the parameters of a Lorenz attractor)

Action-DEM demo specifying an attractor (in terms of the parameters of desired equations of motion) This demo first exposes an agent to a Lorenz attractor world such that it can learn the three parameters of the Lorenz system. It is then placed in a test world with a fixed-point attractor to see if it has remembered the chaotic dynamics it learnt in the training environment ## DEMO_BAYES_FACTORS (Bayes factors and classical p-values)

This demonstration routine uses a simple linear model to examine the relationship between free energy differences or log Bayes factors and classical F statistics. Using re-randomisation of a design matrix, it computes the null distribution over both statistics and plots them against each other.  There is a linear relationship, which allows one to evaluate the false-positive rate for any threshold on the Bayes factor. Ideally, one would like to see a positive log Bayes factor map to a classical threshold of p=0.05. The offset and slope of the linear relationship between the two statistics depends upon prior beliefs about the covariance of the parameters and the log precision. These can be changed by editing the code below (or supplying input arguments).

[error in script]

## DEM_FEP_Least_Action

This routine uses a Lorenz system to show that the most likely autonomous path (or deterministic path) is uniquely identified by its initial and final states. In other words, if we knew the end state of an autonomous trajectory, then we would implicitly know the path taken from the initial particular state, even if we did not know the external states. This point is demonstrated using numerical analyses of the Lorenz system; treating the first state and an active state and the third as an external state.

## DEM_self_entropy (self organisation in terms of entropy of blanket states)

Routine to produce graphics illustrating self organisation in terms of the entropy of blanket states and associated trajectories. A low blanket entropy induces anomalous diffusion and itinerancy with power law scaling (i.e., self similar dynamics). This example uses a fixed form (quadratic) likelihood and optimises the density over hidden states to minimise blanket (i.e., self) entropy explicitly. In this example, there is just one active and sensory state and one hidden state to illustrate noise-phase symmetry breaking as the probability density over action reduces the entropy of sensory states under a fixed density of hidden or external states. ## DEM_path_integrals (approximations to path integrals – Fokker Planck)

Illustration of approximations to path integrals. This routine generates a path from dynamics whose Fokker Planck solution corresponds to a Gaussian with a given (diagonal) precision. It then samples random segments (after scaling and smoothing) and evaluates their action. This evaluation is in terms of the sum of squares residuals between realised and predicted flow and path dependent and path-independent terms based upon the surprisal associated with the solution of the Fokker Planck equation. The point being made here is that the terms based upon the surprisal (cyan dots) upper bound the action (blue dots).

## DEM_demo_GF_and_KF (generalised and Kalman filtering)

A demonstration of generalised and Kalman filtering where the number of hidden states exceeds the number of variables observed. The metrics of performance are the mean sum of squared error and the SSE normalized by the posterior precision (NESS). The results of a single time series analysis are shown first and then the simulations are repeated under linear and nonlinear observation models to compare the relative performance of DEM and EKF. The superiority of DEM (generalised filtering) over Kalman filtering rests on the optimisation of K – the rate of generalised descent on free energy.

## DEM_demo_OU (linear deconvolution – not actually Ornstein-Uhlenbeck process!)

DEM demo for linear deconvolution:  This demo considers the deconvolution of one of the simplest dynamical process; a random walk or Ornstein-Uhlenbeck process.  It shows how DEM can infer on the causes as stochastic innovations (c.f., Bayesian filtering) by exploiting temporal correlations.  Strictly speaking this is not an Ornstein-Uhlenbeck process because the innovations are themselves correlated and would normally be a Wiener process

## DEM_demo_GLM (comparing DEM and ReML under a general linear model)

Demo comparing DEM and ReML (restricted maximum likelihood) under a simple general linear model (GLM).  Slight differences in the hyperpriors of both schemes make this an interesting exercise.  Note that ReML uses a covariance hyper-parameterisation; whereas DEM uses precision hyperparameters.  This demo uses a non-hierarchical GLM and switches the roles of parameters and causes to illustrate their equivalence under a DEM inversion.

## DFP_demo_double_well (comparing Variational filtering with particle filtering)

Demo comparing Variational filtering with particle filtering in the context of a bimodal conditional density.  This demonstrates that the variational filter can not only represent free-form densities on the states but also the causes of responses. ## Worth investigating further

To investigate in detail:

• ADEM_salience and DEM_demo_SOC: both directly provide graphs of free energy.
• ADEM_learning: alternative to reinforcement learning

And also these:

• ADEM_motor and ADEM_reaching: motion as a result of explaining away prediction errors
• ADEM_cued_response: example of pathological condition
• FEP_physics: Markov blankets
• DEM_morphogenesis: graph of free energy
• DEM_MDP_maze: complex exploratory learning

## Forgetful Gaussian Mixture Models

[Although this post concerns Artificial Intelligence, it is in the Neuroscience: Predictive Mind strand of this blogsite (see drop-down tabs) because of its context]

My recent post ‘Bayesian Origins of Neural Nets’ post concerned ‘A tutorial on the free-energy framework for modelling perception and learning’ by Rafal Bogacz (Prof of Computational Neuroscience, Oxford) that went from a single neuron performing Bayesian inference on a (Gaussian) noisy input source to multiple layers of multiple neurons. It characterized a collection of data samples with just a very small number of parameters (2 in the case of a single Gaussian input: a mean and a variance). But this is basically what Gaussian Mixture Models (GMMs) are all about. I commented:

“An exercise for the future: It would be nice to compare what we have here with online implementations of Gaussian Mixture Models, where data is fitted to one of a number of Gaussian curves, with the means and variances of those Gaussians being adjusted on-the-fly.”

Here is that ‘exercise for the future’. A stepping stone to get to here was the previous ‘Sequential K-Means’ post, which represents each of K Gaussians with a just single value: a mean (hence the name ‘K-means’). But it laid out the basic foundations for now:

1. The Python code for generating inputs and plotting outputs.
2. The concept of ‘sequential’ (or ‘online’) versus ‘batch’ learning.
3. The concept of hyperparameters.
4. The concept of the iterative expectation-maximization algorithm

And this ‘Forgetful Gaussian Mixture Models’ post here is a stepping stone towards a simulatable understanding of Karl Friston’s information-theoretic ‘Variational Free Energy’ theory of the brain, with the necessary vocabulary being introduced on the journey. Here:

1. models are simulated that infer values from noisy Gaussian inputs, providing an indication of the precision of that inference.
2. And the Kullback-Leibler Divergence make its appearance.

## The Gaussian Mixture Model

A great visualisation of Gaussian Mixture Models can be found at ‘How to code Gaussian Mixture Models from scratch in Python’ by Thalles Silva, also on morioh.com which links to the Gaussian_Mixture_Models.ipynb Jupyter notebook environment within Google Colab. (Google Colab is an easy online way of running Python code generally and can be used for the Python code below.)

Like K-Means, previously introduced, The Gaussian Mixture Model is an Expectation-Maximization algorithm. But when choosing a cluster for a particular data sample, instead of just selecting the cluster based the sample’s distance to the cluster’s centroid (mean value), it is based on the probability of belonging to any cluster, by using the Normal (Gaussian) function with both a mean value and a variance value associated with each cluster.

Thus in the example below, a sample at x=2 is more strongly associated with cluster 1, a distance of 3 away, than with cluster 2, a distance of only 2 away, because cluster 1 is more spread out (variance σ2=4; standard deviation σ=2) than cluster 2 (variance σ2=1; standard deviation σ=1).

```from numpy import sqrt, square, pi, exp
def normal_pdf(data, mean, variance):
return 1/(sqrt(2*pi*variance)) * exp(-(square(data - mean)/(2*variance)))

Pcluster1 = normal_pdf(2, 5, square(2))
Pcluster2 = normal_pdf(2, 0, square(1))
print("# cluster 1: P(x=2 | μ=5, σ=2) = %6.3f" % (Pcluster1))
print("# cluster 2: P(x=2 | μ=0, σ=1) = %6.3f" % (Pcluster2))
print("# P(cluster=1) = %6.3f" % (Pcluster1 / (Pcluster1+Pcluster2)))
print("# P(cluster=2) = %6.3f" % (Pcluster2 / (Pcluster1+Pcluster2)))
# Generates:
# cluster 1: P(x=2 | μ=5, σ=2) =  0.065
# cluster 2: P(x=2 | μ=0, σ=1) =  0.054
# P(cluster=1) =  0.545
# P(cluster=2) =  0.455
```

The probabilities have been normalized to produce posterior probabilities: the sample has to belong to one of the clusters i.e. P(cluster=1) + P(cluster=2)=1.

With GMM, there is also a weighting value associated with each cluster and, in the maximization phase of the EM algorithm, weight, mean and variance are all updated on the basis of the normalized probabilities. This process is iterated until the weight, mean and variance values for each cluster converge on settled values.

```# Derived from Thalles Silva code
eps=1e-8 # Small value to prevent division by zero
def batch_gmm(num_iterations):
for t in range(1, num_iterations+1):
# Expectation step: calculate the likelihood of each observation xi
likelihood = []
for k in range(K):
likelihood.append(normal_pdf(sample, mean[k], variance[k]))
likelihood = np.array(likelihood)

# Maximization step
b = []
for k in range(K):
# Use the current values for the parameters to evaluate the posterior
# probabilities of the data to have been generated by each Gaussian
b.append((likelihood[k] * weight[k]) / (np.sum([likelihood[i] * weight[i] for i in range(K)], axis=0)+eps))

# update mean, variance and weight
mean[k]     = np.sum(b[k] * sample) / (np.sum(b[k]+eps))
variance[k] = np.sum(b[k] * square(sample - mean[k])) / (np.sum(b[k]+eps))
weight[k]   = np.mean(b[k])
```

Every cluster is updated, based on the degree of membership of the data sample to that clusters (in K-means, only the closest cluster was affected).

If we set the Python code at the bottom of this post to select…

```K = 3 # Select 2 or 3
if K==3: # Main 3-cluster example
dist_set1 = [ # Gaussians clearly separable (relatively small variances)
{'mean': -8, 'variance': 1, 'weight': 0.2},
{'mean':  0, 'variance': 1, 'weight': 0.2},
{'mean':  8, 'variance': 1, 'weight': 0.6}]
```

Running the code produces the following graph…

… in which:

• Green dots are inputs from ‘sample=1’ to ‘sample=400’ in accordance with the input distribution, dist_set1.
• For each green dot plotted, there are blue ‘Y’ brackets which indicate μ+2σ and μ-2σ: the region that is 2 standard deviations either side of the mean, within which 95% of all the samples will fall, within the particular input Gaussian.
• Note that there are many more green dots in the top (average=8) Gaussian – because of the 60:20:20 weighting.
• The red ‘-‘ marks indicate the location of the centroids (means) of the 3 clusters, from initialization (11, 3 and -11) and iteration=1 through to iteration=400.
• The orange ‘-‘ marks indicate the location of the μ+2σ and μ-2σ limits from the mean and variance values learnt.

The means and variances quickly converge to stable values – in 10 iterations – with the orange marks converging to match the ‘Y’ brackets quite well (it would be better for a larger sample set). The converged values below match those defined in dist_set1 quite well.

```# Iteration  1 mvw:(-8.179, 0.873, 0.185)( 2.098,10.964, 0.324)( 8.287, 0.547, 0.491)
# Iteration  2 mvw:(-8.236, 0.726, 0.180)( 1.635,11.645, 0.301)( 8.157, 0.623, 0.518)
# Iteration  3 mvw:(-8.251, 0.700, 0.178)( 1.149,10.562, 0.281)( 8.100, 0.702, 0.541)
# Iteration  4 mvw:(-8.255, 0.695, 0.178)( 0.678, 8.305, 0.261)( 8.066, 0.774, 0.561)
# Iteration  5 mvw:(-8.250, 0.701, 0.179)( 0.247, 5.293, 0.242)( 8.037, 0.838, 0.579)
# Iteration  6 mvw:(-8.235, 0.722, 0.181)(-0.056, 2.638, 0.227)( 8.006, 0.888, 0.592)
# Iteration  7 mvw:(-8.220, 0.748, 0.183)(-0.167, 1.562, 0.220)( 7.985, 0.930, 0.597)
# Iteration  8 mvw:(-8.197, 0.817, 0.184)(-0.183, 1.220, 0.216)( 7.973, 0.966, 0.600)
# Iteration  9 mvw:(-8.181, 0.866, 0.185)(-0.171, 1.091, 0.215)( 7.971, 0.973, 0.600)
# Iteration 10 mvw:(-8.179, 0.871, 0.185)(-0.169, 1.077, 0.215)( 7.971, 0.974, 0.600)
…
```

## Online Gaussian Mixture Models

The batch algorithm requires all input samples to be stored. In contrast, we are interested in online algorithms as previously. Things get more complicated. “A view of the EM Algorithm that Justifies Incremental Sparse and other Variants”, (Radford Neal and Geoffrey Hinton, University of Toronto, 1998) covers Online Expectation-Maximization algorithms but not specifically GMMs. “Online Bayesian Learning in Probabilistic Graphical Models using Moment Matching with Applications” (Farheen Omar’s DPhil thesis, University of Waterloo, 2016) includes moment matching. The concepts are merged in “Online and Distributed learning of Gaussian Mixture Models by Bayesian Moment Matching” (Priyank Jaini and Pascal Poupart, University of Waterloo, 2016) which:

• Controls the way that weights, means and variances are updated by using moment matching. (Recall school physics: moments are to do with balancing weights, F = ∫M(x) dx. In statistics, the mean is the first moment E[μ] = ∫ x.P(x) dx and variance is the second moment, E2] = ∫ x2.P(x) dx).
• Means and variances are updated using the assumption that the parameters fit with a Normal-Wishart distribution. This reduces to a Normal-Gamma distribution when only considering one input dimension; The Normal-Gamma distribution NG(x, τ | μ, λ, α, β) uses the Gamma function, Γ(x), which is related to the factorial function: n! = Γ(n+1). The distribution reduces to the Normal distribution N(x | μ, λ) when α=β=τ=1.
• Weights are updated using the assumption that the parameters fit with a Dirichlet distribution. (which concerns a discrete number of items; the number of clusters is always an integer). The Dirichlet distribution uses the Beta function B(x) which in turn uses the Gamma function.

The main problem is in calculating expectation values E[.] of the various moments. I will hopefully be addressing that in a future post.

## Precision

Just above, I referred to the Normal distribution as  N(x | μ, λ) using λ (lambda) rather than σ2 or σ. Lambda is the reciprocal of variance (λ = 1/σ2) and is known as ‘precision’. A large variance implies that the precise value (estimated by the mean) is either not known (the variance has not converged yet) or not knowable (the input is very noisy). A small variance implies that we know the value within small bounds. Whether this it correct or, conceivably, incorrect, we think we have high precision).

In Variational Free Energy, precision plays an important role: beliefs are precision-weighted. An argument for the theory’s practical value is the proposition that most psychopathologies are a result of errors in precision, rather than errors of value. To put this into a Gaussian model context, this is about errors associated with variance rather than the mean.

## A Forgetful Gaussian Mixture Model

Instead of getting into this complexity, I will do what was done in the previous ‘Sequential K-Means’ post which introduced a ‘Forgetful’ algorithm that was able to adapt to a changing input distribution. A parameter, a, set the learning rate. For example a=0.1 meant that the new mean value was based 10% on the input sample x[t] and 90% on the old mean value. We can conceptually imagine this as always having 10 samples and when updating:

1. We remove one of the samples, to leave 9 samples. The sample removed had a value equal to the old mean value (what else could we set it to?).
2. We add the new sample, x[t], to bring the number of samples up to 10 again. The mean value gets updated.

Applying the same thinking to Gaussian Mixture Models leads to update rules as follows…

```def online_update_mean_and_variance(new_data, weight, mean, variance, hyperparameter_a):
# 1. Convert from mean and variance to sums, and
# 2. Effectively remove 1 sample from the existing distribution (keeping the mean and variance the same)…
rolling_samples = 1/hyperparameter_a
num_items      = rolling_samples - weight
sum_x          = mean * num_items
sum_x_squared  = num_items * (square(mean) + variance)
# 3. Add the new item (taking the number of samples back to the original
num_items      = rolling_samples
sum_x         += weight * new_data
sum_x_squared += weight * square(new_data)
# 4. Convert back to mean and variance...
new_mean       = sum_x/num_items
new_variance   = sum_x_squared/num_items - square(new_mean)
# 'rolling_samples' controls how quickly old values are forgotten.
# e.g rolling_samples=10 implies new mean is the weighted sum of 90% old + 10% new.
return new_mean, new_variance

def online_update_weights(previous_weights, winner, hyperparameter_a):
new_weights = []
# e.g. a=0.1 (rolling_samples=10)
# 1. Reduce the membership from rolling_samples to rolling_samples-1 members
#    All the weights are multiplied by 0.9=1-a
for k in range(K):
new_weights.append( (1-hyperparameter_a) * previous_weights[k] )
# 2. Add the winner to the membership
new_weights[winner] += hyperparameter_a
# The weights still sum to 1
return new_weights
```

In this, only the most-likely cluster is updated (unlike batch GMM above which does a weighted update of all clusters). The find_closest_cluster() function from previously becomes:

```def find_closest_cluster(x, t, metric='distance'):
if metric == 'distance': # Find MINIMUM distance
closest_cluster, closest_value = 999999, 999999 # initially invalid
for k in range(K): # no. clusters
distance = (x - mean[k])**2 # Euclidean distance
if distance  closest_value:
closest_cluster  = k
closest_value    = likelihood
return closest_cluster
```

Additionally, mean and variance are only updated to the degree that the new sample is a member of that most-likely cluster. Thus, if we are conceptually keeping the last 10 samples of that cluster (a=0.1), updating when the likelihood of the new sample being a member is 0.4, then we only take away 0.4 samples of the previous ‘10’ and add 0.4 new samples.

Running the Forgetful GMM with a drifting input distribution …

```K=3
if K==3: # Main 3-cluster example
dist_set1 = [ # Gaussians clearly separable (relatively small variances)
{'mean': -8, 'variance': 1, 'weight': 0.2},
{'mean':  0, 'variance': 1, 'weight': 0.2},
{'mean':  8, 'variance': 1, 'weight': 0.6}]
dist_set2 = [ # Gaussians spread and merging
{'mean': -7, 'variance': 3, 'weight': 0.2},
{'mean': -2, 'variance': 2, 'weight': 0.6},
{'mean':  6, 'variance': 1, 'weight': 0.2}]
test_runs = [ "Forgetful GMM (a=0.1): drifting distribution" ]
```

produces the input distribution graph showing the drift…

and the scatter plot output of the simulation:

Scatter plot for ‘Forgetful’ online Gaussian Mixture Model

This time, sample points are colored according to which cluster it gets allocated to: ‘lime green (k=0), cyan (k=1) or ‘rebeccapurple’ (k=2). With k=0, convergence is slow because there are relatively few samples (compared with cluster 2); the good convergence of cluster 1 is fortunate.

The output (inferred) distributions are shown graphically below. The heights of the distributions are changing in time because the variances are changing and the weighting between clusters is shifting. Sum of 3 weighted Gaussian distributions output from online Gaussian Mixture Model, as it evolves over time

The transcript reports the inferred Gaussians for each iteration:

```Python Forgetful iterations [[[
# Iteration  1 pmvw:( 0.000,-11.000, 2.000, 0.306)( 0.000, 3.000, 2.000, 0.297)( 0.079*,10.925, 2.097, 0.397)KL=0.0020 x= 8.742
# Iteration  2 pmvw:( 0.000,-11.000, 2.000, 0.275)( 0.000, 3.000, 2.000, 0.267)( 0.175*,10.871, 2.086, 0.457)KL=0.0007 x= 9.548
# Iteration  3 pmvw:( 0.000,-11.000, 2.000, 0.248)( 0.004, 3.000, 2.000, 0.241)( 0.010*,10.701, 2.596, 0.512)KL=0.0198 x= 7.147
# Iteration  4 pmvw:( 0.000,-11.000, 2.000, 0.223)( 0.000, 3.000, 2.000, 0.217)( 0.222*,10.662, 2.490, 0.560)KL=0.0007 x= 9.955
# Iteration  5 pmvw:( 0.000,-11.000, 2.000, 0.201)( 0.000, 3.000, 2.000, 0.195)( 0.169*,10.583, 2.456, 0.604)KL=0.0013 x= 9.246
# Iteration  6 pmvw:( 0.000,-11.000, 2.000, 0.181)( 0.002, 3.000, 2.000, 0.175)( 0.035*,10.395, 2.858, 0.644)KL=0.0133 x= 7.469
…
# Iteration 122 pmvw:( 0.028*,-10.064, 3.414, 0.154)( 0.000, 0.813, 3.960, 0.470)( 0.000, 7.807, 1.166, 0.376)KL=0.0002 x=-6.369
# Iteration 123 pmvw:( 0.000,-10.064, 3.414, 0.139)( 0.127*, 0.723, 3.937, 0.523)( 0.000, 7.807, 1.166, 0.338)KL=0.0010 x=-1.096
# Iteration 124 pmvw:( 0.000,-10.064, 3.414, 0.125)( 0.057*, 0.888, 4.221, 0.571)( 0.000, 7.807, 1.166, 0.304)KL=0.0047 x= 3.869
# Iteration 125 pmvw:( 0.000,-10.064, 3.414, 0.112)( 0.161*, 0.817, 4.064, 0.614)( 0.000, 7.807, 1.166, 0.274)KL=0.0010 x=-0.360
# Iteration 126 pmvw:( 0.117*,-10.041, 3.422, 0.201)( 0.000, 0.817, 4.064, 0.552)( 0.000, 7.807, 1.166, 0.247)KL=0.0001 x=-8.015
…
# Iteration 149 pmvw:( 0.000,-9.896, 3.645, 0.164)( 0.000, 0.045, 3.683, 0.483)( 0.366*, 7.694, 1.046, 0.353)KL=0.0002 x= 7.374
# Iteration 150 pmvw:( 0.074*,-9.851, 3.707, 0.248)( 0.000, 0.045, 3.683, 0.435)( 0.000, 7.694, 1.046, 0.317)KL=0.0004 x=-7.143
# Iteration 151 pmvw:( 0.000,-9.851, 3.707, 0.223)( 0.069*, 0.168, 3.859, 0.492)( 0.000, 7.694, 1.046, 0.286)KL=0.0026 x= 2.886
# Iteration 152 pmvw:( 0.000,-9.851, 3.707, 0.201)( 0.000, 0.168, 3.859, 0.442)( 0.360*, 7.706, 1.021, 0.357)KL=0.0002 x= 8.102
…
# Iteration 396 pmvw:( 0.005,-8.773, 3.265, 0.080)( 0.136*,-2.196, 2.363, 0.771)( 0.000, 6.641, 0.886, 0.149)KL=0.0037 x=-3.807
# Iteration 397 pmvw:( 0.000,-8.773, 3.265, 0.072)( 0.251*,-2.166, 2.191, 0.794)( 0.000, 6.641, 0.886, 0.134)KL=0.0016 x=-1.810
# Iteration 398 pmvw:( 0.000,-8.773, 3.265, 0.064)( 0.000,-2.166, 2.191, 0.715)( 0.406*, 6.645, 0.875, 0.221)KL=0.0000 x= 6.920
# Iteration 399 pmvw:( 0.000,-8.773, 3.265, 0.058)( 0.257*,-2.134, 2.048, 0.743)( 0.000, 6.645, 0.875, 0.199)KL=0.0014 x=-1.713
# Iteration 400 pmvw:( 0.154*,-8.764, 3.259, 0.152)( 0.000,-2.134, 2.048, 0.669)( 0.000, 6.645, 0.875, 0.179)KL=0.0000 x=-7.242
]]]
```

Each line reports the PDF N(μ, σ2), mean, variance and weight (hence ‘pmvw’) of each cluster in turn k=0, k=1 and k=2. The asterisk identifies the winning cluster. Additionally, the ‘Kullback-Leibler divergence’ and the input sample value are given.

For example, at iteration 124 there is a sample (values 3.869) that can be easily be picked out on the scatter plot, lying between clusters 1 (cyan) and 2 (purple). It has been allocated to cluster 1 (colored cyan). It can be seen:

• from the mean and variance traces (red and orange) that it is closest to cluster 1 (middle), yet
• from the input distributions (blue ‘Y’s) that it probably originated from cluster 2 (top, purple).

## The Kullback-Leibler Divergence

Note that the KL value for sample 124 is high compared with other iterations. Iteration 151 also has a high value and it also stands out on the scatter plot (x=2.886). This Kullback-Leibler Divergence value is a measure of the relative entropy between the prior and posterior (before and after) values of cluster means and variances (it should really include the effect of weight but doesn’t). That is, it represents the change in distributions which indicates the amount of information that has been learnt as a result of the additional sample. When a new sample falls close to the mean, very little is learnt. Samples 124 and 151 were relatively surprising and caused a big shift in the mean and variance parameters.

The KL-divergence between two univariate Gaussians, as we are interested in here, is calculated as follows:

```def kullback_leibler_divergence(m1, v1, m2, v2): # KL(P||Q)
# m1, v1: mean and variance of the posterior P
# m2, v2: mean and variance of the prior Q
sd1, sd2 = np.sqrt(v1), np.sqrt(v2) # standard deviations
return (v1-v2+(m1-m2)**2)/v2/2 + np.log(sd2/sd1)
```

The Kullback-Leibler divergence is always positive because it is a difference. Also note: KL(P || Q) KL(Q || P).

## Crossing Clusters

If we set the Python code to use only 2 clusters in which a small variance distribution crosses over a much wider one…

```K=2
if K==2: # 2-cluster cross-over example
dist_set1 = [
{'mean': 0,  'variance': 0.1,  'weight': 0.5},
{'mean': -1, 'variance': 4,    'weight': 0.5}]
dist_set2 = [
{'mean': -1, 'variance': 0.1,  'weight': 0.5},
{'mean': 0,  'variance': 4,    'weight': 0.5}]
num_samples = 1000
test_runs = [ "Forgetful GMM (a=0.1): drifting distribution" ]
```

… the algorithm tracks small-variance Cluster zero (green samples) as it rides straight through the wider variance cluster 1 (cyan), as shown in the scatter plot below… Scatter plot for an online Gaussian Mixture Model in which 2 input distributions cross one another

The output distribution shows this… Sum of 2 weighted Gaussian distributions output from online Gaussian Mixture Model, as it evolves over time. The input distribution comprises 2 distributions which cross one another

## The Full Python Code

```
import matplotlib.pyplot as plt
import numpy as np
from numpy import sqrt, square, pi, exp, zeros, linspace
from copy  import deepcopy

##############################################################
# Input: samples from a sum of Gaussian distributions
##############################################################

K = 3 # Select 2 or 3

if K==3: # Main 3-cluster example
dist_set1 = [ # Gaussians clearly separable (relatively small variances)
{'mean': -8, 'variance': 1, 'weight': 0.2},
{'mean':  0, 'variance': 1, 'weight': 0.2},
{'mean':  8, 'variance': 1, 'weight': 0.6}]
dist_set2 = [ # Gaussians spread and merging
{'mean': -7, 'variance': 3, 'weight': 0.2},
{'mean': -2, 'variance': 2, 'weight': 0.6},
{'mean':  6, 'variance': 1, 'weight': 0.2}]
# Compared with dist_set1, dist_set2 has:
# * the means closer together,
# * larger variances and
# * weighting drifting
num_samples = 400

if K==2: # 2-cluster cross-over example
# A very small variance distribution crosses over a much wider one
dist_set1 = [
{'mean': 2,  'variance': 0.1,  'weight': 0.5},
{'mean': -3, 'variance': 4,    'weight': 0.5}]
dist_set2 = [
{'mean': -3, 'variance': 0.1,  'weight': 0.5},
{'mean': 2,  'variance': 4,    'weight': 0.5}]
num_samples = 1000

# Note: The weights must add up to 1.0
K = len(dist_set1) # Number of input and K-means clusters
num_iterations = num_samples

# Input means, variances and weights will shift over time. Use these vectors for this...
m_vec, v_vec, w_vec = zeros((K, num_samples)), zeros((K, num_samples)), zeros((K, num_samples))
# Samples will be drawn from a distribution that gradually shifts from dist1 to dist2.
def generate_distribution_vectors(num_samples, dist1, dist2):
# Parallel interpolation using linspace...
for i in range(K):
m_vec[i]  = linspace(dist1[i]['mean'],     dist2[i]['mean'],     num_samples)
v_vec[i]  = linspace(dist1[i]['variance'], dist2[i]['variance'], num_samples)
w_vec[i]  = linspace(dist1[i]['weight'],   dist2[i]['weight'],   num_samples)

def normal_pdf(data, mean, variance):
return 1/(sqrt(2*pi*variance)) * exp(-(square(data - mean)/(2*variance)))

def online_update_mean_and_variance(new_data, weight, mean, variance, hyperparameter_a):
# 1. Convert from mean and variance to sums, and
# 2. Effectively remove 1 sample from the existing distribution...
#    (keeping the mean and variance the same)
rolling_samples = 1/hyperparameter_a
num_items      = rolling_samples - weight
sum_x          = mean * num_items
sum_x_squared  = num_items * (square(mean) + variance)
# 3. Add the new item (taking the number of samples back to the original
num_items      = rolling_samples
sum_x         += weight * new_data
sum_x_squared += weight * square(new_data)
# 4. Convert back to mean and variance...
new_mean       = sum_x/num_items
new_variance   = sum_x_squared/num_items - square(new_mean)
# 'rolling_samples' controls how quickly old values are forgotten.
# e.g rolling_samples=10 implies new mean is the weighted sum of 90% old + 10% new.
return new_mean, new_variance

def online_update_weights(previous_weights, winner, hyperparameter_a):
new_weights = []
# e.g. a=0.1 (rolling_samples=10)
# 1. Reduce the membership from rolling_samples to rolling_samples-1 members
#    All the weights are multiplied by 0.9=1-a
for k in range(K):
new_weights.append( (1-hyperparameter_a) * previous_weights[k] )
# 2. Add the winner to the membership
new_weights[winner] += hyperparameter_a
# The weights still sum to 1
return new_weights

##############################################################
# 3-D plotting of sum of Gaussian distributions
##############################################################

# To produce a sum of Gaussians distribution for every time step
def weighted_sum_of_gaussians(x, t):
result = 0
for k in range(K):
result += w_vec[k][t] * normal_pdf(x, m_vec[k][t], v_vec[k][t])
return result

if True:
print("# Plotting surface plot of input distribution")
x_max = 12 # x sample input is over range -xmax to +xmax
steps_per_unit_x = 10 # enough to get good surface plot
total_x_steps = (2 * x_max * steps_per_unit_x)

# Produce Z[t, x] matrix (over all x input values and all iterations)
# of sum-of-Gaussians
generate_distribution_vectors(num_samples, dist_set1, dist_set2)
x_range = np.arange(-x_max, x_max, 1/steps_per_unit_x)
t_range = np.arange(1, num_samples+1)
Z = zeros((num_samples, total_x_steps))
for x in range(total_x_steps):
for t in range(num_samples):
Z[t, x] = weighted_sum_of_gaussians(x_range[x], t)
X, T = np.meshgrid(x_range, t_range)

# Plot the surface.
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter

fig = plt.figure(figsize=(10,6))
surf = ax.plot_surface(X, T, Z, rstride=1, cstride=1, cmap=cm.coolwarm, linewidth=0.25)

# Customize the z axis.
ax.set_zlim(0.0, 0.4)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))

# Add a color bar which maps values to colors.
fig.colorbar(surf, shrink=0.5, aspect=5)

ax.set_xlabel('x')
ax.set_ylabel('iterations')
ax.set_zlabel('PDF')

plt.show()

###############################
# Options
###############################

# devs: the variance bars are plotted this no. of std deviations away from mean
devs = 2 # (1 =&amp;amp;amp;amp;gt; 68%, 2 =&amp;amp;amp;amp;gt; 95%, 3 =&amp;amp;amp;amp;gt; 99.7%)

np.random.seed(0) # for repeatable pseudo-random results

# Uses m_vec, v_vec and w_vec vectors of mean, variance and weight.
def generate_samples(num_samples):
for t in range(num_samples):
source = np.random.choice(a=list(range(K)), size=1, p = list(w_vec[:, t]))
mean   =      m_vec[source, t]
stddev = sqrt(v_vec[source, t])
sample_history.append( np.random.normal(mean, stddev))
# Also generate vectors for plotting results
centr_history.append( mean )
lower_history.append( mean - stddev*devs )
upper_history.append( mean + stddev*devs )

# Output: Graphical plot
def scatter_plot(title, iterations):
plt.figure(figsize=(10,6))
axes = plt.gca()
plt.xlabel("samples / iterations")
plt.ylabel("x")
plt.title(title)
steps = list(range(len(centr_history)))
plt.scatter(steps, upper_history,   color='navy',  s=30, marker="1", label=("μ+%dσ" % (devs)))
plt.scatter(steps, lower_history,   color='navy',  s=30, marker="2", label=("μ-%dσ" % (devs)))
for k in range(K):
plt.scatter(sorted_sample_t[k], sorted_sample_x[k],  color=palette[6*k+7], s=30, marker=".", label=("k=%d % (k))"))
# '6*k+7': Plot the (max K=3) clusters in 'lime green', 'cyan' and 'rebeccapurple'
steps = list(range(iterations+1))
for k in range(K):
# Convert variance to 2 standard deviations...
plus2sd_history = mean_history + 2*sqrt(variance_history)
less2sd_history = mean_history - 2*sqrt(variance_history)
plt.scatter(steps,     mean_history[0:iterations+1,k], color='red',     s=30, marker="_", label=("μ[%d][t]"     %       (k)))
plt.scatter(steps,  plus2sd_history[0:iterations+1,k], color='orange',  s=30, marker="_", label=("μ+%dσ[%d][t]" % (devs, k)))
plt.scatter(steps,  less2sd_history[0:iterations+1,k], color='orange',  s=30, marker="_", label=("μ+%dσ[%d][t]" % (devs, k)))
plt.show()

# Plot multiple Gaussians using a spectrum of colors...
palette = ['red', 'orangered', 'darkorange', 'orange', 'gold', 'yellow', 'greenyellow', 'limegreen', 'green',
'mediumseagreen', 'mediumaquamarine', 'mediumturquoise', 'paleturquoise', 'cyan', 'deepskyblue', 'dodgerblue',
'royalblue', 'navy', 'slateblue', 'rebeccapurple', 'darkviolet', 'violet', 'magenta']

def plot_gaussians(iterations, plot_end_distribution=False):
bins = np.linspace(-15,15,300)

plt.figure(figsize=(7,5)) # (10,7) may be too large
plt.xlabel("\$x\$")
plt.ylabel("pdf")
plt.scatter(sample_history, [-0.005] * len(sample_history), color='navy', s=30, marker=2, label="Sample data")

for t in range(iterations+1): # early timesteps
for k in range(K):
plt.plot(bins, weight_history[t,k]*normal_pdf(bins, mean_history[t,k], variance_history[t,k]), color=palette[(t%8)+9])
for k in range(K): # Initial (or constant) input distribution
plt.plot(bins, dist_set1[k]['weight']*normal_pdf(bins, dist_set1[k]['mean'], dist_set1[k]['variance']), color='red')
if plot_end_distribution:
for k in range(K): # Final input distribution
plt.plot(bins, dist_set2[k]['weight']*normal_pdf(bins, dist_set2[k]['mean'], dist_set1[k]['variance']), color='magenta')

plt.legend()
plt.plot()
plt.show()

# In order to plot samples allocated to different clusters in different colors...
sorted_sample_x = [ [], [], [] ]
sorted_sample_t = [ [], [], [] ]

def find_closest_cluster(x, t, metric='distance'):
if metric == 'distance': # Find MINIMUM distance
closest_cluster, closest_value = 999999, 999999 # initially invalid
for k in range(K): # no. clusters
distance = (x - mean_history[t-1,k])**2 # Euclidean distance
if distance  closest_value:
closest_cluster  = k
closest_value    = likelihood
# Record result for later plotting
sorted_sample_x[closest_cluster].append(x)
sorted_sample_t[closest_cluster].append(t)
return closest_cluster

def kullback_leibler_divergence(m1, v1, m2, v2): # KL(P||Q)
# m1, v1: mean and variance of the posterior P
# m2, v2: mean and variance of the prior Q
sd1, sd2 = np.sqrt(v1), np.sqrt(v2) # standard deviations
return (v1-v2+(m1-m2)**2)/v2/2 + np.log(sd2/sd1)

##################################################################
# Algorithms: K-means
##################################################################

def naive_k_means(num_iterations):
for t in range(1, num_iterations+1):
# Assignment phase: Assign each x_ix i to nearest cluster by calculating its distance to each centroid.
sums     = zeros((K)) # sum of x's for each cluster
counts   = zeros((K)) # count of number of samples in each cluster
for sample in sample_history:
allocated_cluster = find_closest_cluster(sample, mean_history[t-1])
counts[allocated_cluster] += 1
sums[allocated_cluster]   += sample
# Update phase: Find new cluster center by taking the average of the assigned points.
print("# Iteration %2d centroids:" % (t), end='')
for k in range(K):
mean_history[t,k] = sums[k]/counts[k]
variance_history[t,k] = 2 # !!!!!!!!!!!!! Temp !!!!!!!!
print(" %6.3f" % (mean_history[t,k]), end='')
print("")

def sequential_k_means(num_samples, forgetful=False, a=0):
sample_count = zeros((K))
for t in range(1, num_samples+1):
closest_k = find_closest_cluster(sample_history[t-1], t)
sample_count[closest_k] += 1
mean_history[t] = deepcopy(mean_history[t-1]) # carry over from previous ...
# ... but then adjust the *closest* centroid ...
if forgetful:
# c[t]=c[t-1]+a(x[t]-c[t-1]) = (1-a).c[t-1]+a.x[t]
mean_history[t,closest_k] += (sample_history[t-1] - mean_history[t-1,closest_k])*a
else: # normal sequential K-means...
mean_history[t,closest_k] += (sample_history[t-1] - mean_history[t-1,closest_k])/sample_count[closest_k]
print("# Iteration %2d centroids:" % (t), end='')
for k in range(K):
print(" %6.3f" % (mean_history[t,k]), end='')
print("")

##################################################################
# Algorithms: Gaussian Mixture Models
##################################################################

eps=1e-8 # Small value to prevent division by zero
def batch_gmm(num_iterations):
for t in range(1, num_iterations+1):
# Expectation step: calculate the maximum likelihood of each observation xi
likelihood = []
for k in range(K):
likelihood.append(normal_pdf(sample_history, mean_history[t-1,k], variance_history[t-1,k]))
likelihood = np.array(likelihood)

# Maximization step
b = []
for k in range(K):
# Use the current values for the parameters to evaluate the posterior
# probabilities of the data to have been generated by each Gaussian
b.append((likelihood[k] * weight_history[t-1,k]) / (np.sum([likelihood[i] * weight_history[t-1,i] for i in range(K)], axis=0)+eps))

# update mean, variance and weight
mean_history[t,k]     = np.sum(b[k] * sample_history) / (np.sum(b[k]+eps))
variance_history[t,k] = np.sum(b[k] * square(sample_history - mean_history[t,k])) / (np.sum(b[k]+eps))
weight_history[t,k]   = np.mean(b[k])

# Reporting...
print("# Iteration %2d mvw:" % (t), end='') # mvw = mean, variance, weight in that order
for k in range(K):
print("(%6.3f,%6.3f,%6.3f)" % (mean_history[t,k],variance_history[t,k],weight_history[t,k]), end='')
print("")
for x in range(len(sample_history)): # Not color-coding Batch samples
sorted_sample_x.append(sample_history[x])
sorted_sample_t.append(x)

def forgetful_online_gmm(num_samples, hyperparameter_a):
for t in range(1, num_samples+1):
# Expectation step: calculate the maximum likelihood of each observation xi
winner = find_closest_cluster(sample_history[t-1], t, metric='probability')

#likelihood = []
#for k in range(K):
#  likelihood.append(normal_pdf(sample_history[t-1], mean_history[t-1,k], variance_history[t-1,k]))
#likelihood = np.array(likelihood)

# Maximization step
#membership = [] # represents the degree of membership to each cluster
#for k in range(K):
#  # Use the current values for the parameters to evaluate the posterior
#  # probabilities of the data to have been generated by each Gaussian
#  membership.append((likelihood[k] * weight_history[t-1,k]) / (np.sum([likelihood[i] * weight_history[t-1,i] for i in range(K)], axis=0)+eps))

# copy from previously...
mean_history[t]     = deepcopy(mean_history[t-1])
variance_history[t] = deepcopy(variance_history[t-1])
weight_history[t]   = deepcopy(  weight_history[t-1])
# update mean, variance and weight of the winning cluster
mean_history[t, winner], variance_history[t, winner] = online_update_mean_and_variance(sample_history[t-1], weight_history[t-1, winner], mean_history[t-1, winner], variance_history[t-1, winner], hyperparameter_a)
weight_history[t] = online_update_weights(weight_history[t-1], winner, hyperparameter_a)

# Reporting normal PDF, mean, variance and weight of each cluster at each iteration...
print("# Iteration %2d pmvw:" % (t), end='') # pmvw = PDF, mean, variance, weight in that order
for k in range(K):
print("(%6.3f" % (normal_pdf(sample_history[t-1], mean_history[t-1,k], variance_history[t-1,k])), end='')
if k == winner:
print("*", end='')
print(",%6.3f,%6.3f,%6.3f)" % (mean_history[t,k],variance_history[t,k],weight_history[t,k]), end='')
# Reporting the Kullback-Leibler divergence and the input sample value...
print("KL=%.4f" % (kullback_leibler_divergence(mean_history[t,winner], variance_history[t,winner], mean_history[t-1,winner], variance_history[t-1,winner])), end='')
print(" x=%6.3f" % (sample_history[t-1]), end='')
print("")

##################################################################
# Run numerous combinations of distribution and algorithm
##################################################################

# Uncomment to include...
test_runs = [ #"Naive K-means: constant distribution",
#"Naive K-means: drifting distribution",
#"Sequential K-means: constant distribution",
#"Sequential K-means: drifting distribution",
#"Forgetful K-means (a=0.01): constant distribution",
#"Forgetful K-means (a=0.1): drifting distribution" ]
#"Batch GMM: constant distribution",
#"Batch GMM: drifting distribution" ,
#"Forgetful GMM (a=0.1): constant distribution",
#"Forgetful GMM (a=0.1): drifting distribution",
#"Forgetful GMM (a=0.05): drifting distribution",
"Forgetful GMM (a=0.1): drifting distribution" ]

for test_name in test_runs:
print("# Running %s #######################" % (test_name))
if "a=0.01" in test_name: hyperparameter_a = 0.01 # hyper-parameter for Forgetful Sequential K-means
if "a=0.05" in test_name: hyperparameter_a = 0.05
if "a=0.1"  in test_name: hyperparameter_a = 0.1
if "a=0.2"  in test_name: hyperparameter_a = 0.2
print("# Generate input sample values #######################")
centr_history, upper_history, lower_history, sample_history = [ ], [ ], [ ], [ ]
if "constant" in test_name: # input sequence from distribution dist_set1
generate_distribution_vectors(num_samples, dist_set1, dist_set1)
elif "drifting" in test_name: # input sequence drifting from distribution dist_set1 to dist_set2
generate_distribution_vectors(num_samples, dist_set1, dist_set2)
generate_samples(num_samples)

# This is the container for the results, to be able to plot the position of the centroids over time
mean_history     = zeros((1+num_iterations, K)) # for cluster k at iteration t
variance_history = zeros((1+num_iterations, K)) # for cluster k at iteration t
weight_history   = zeros((1+num_iterations, K)) # for cluster k at iteration t
# Size is 'num_iterations+1' rather than 'num_iterations' because centroids[0,] is the *initial* guess...
print("# Setting initial values of parameters #######################")
if K==2: # Start estimate very similar to actual start
mean_history, variance_history, weight_history = [2,-3], [0.1,2], [0.5,0.5] # Adds up to 1
elif K==3:
mean_history, variance_history, weight_history = [-11, 3, 11], [2,2,2], [0.34,0.33,0.33]

print("# Run algorithm #######################")
if "Naive K-means" in test_name: # standard (batch) K-means
naive_k_means(num_iterations) # Will actually converge in about 2 iterations
elif "Forgetful K-means" in test_name:
sequential_k_means(num_samples, forgetful=True, a=hyperparameter_a)
elif "Sequential K-means" in test_name:
sequential_k_means(num_samples)
elif "Batch GMM" in test_name:
batch_gmm(num_iterations) # Will actually converge in a few iterations
elif "Forgetful GMM" in test_name:
forgetful_online_gmm(num_samples, hyperparameter_a)

if True:
print("# Plot output scatter plot #######################")
if ("Naive" in test_name) or ("Batch" in test_name):
scatter_plot(test_name, num_iterations)
else:
scatter_plot(test_name, num_samples)

##############################################################
# 3-D plotting of output Gaussian distributions
##############################################################

def weighted_sum_of_output_gaussians(x, t):
result = 0
for k in range(K):
result += weight_history[t,k] * normal_pdf(x, mean_history[t,k], variance_history[t,k])
return result

if True:
print("# 3-D Plot of output Gaussians #######################")
x_max = 15 # x sample input is over range -xmax to +xmax
steps_per_unit_x = 10 # enough to get good surface plot
total_x_steps = (2 * x_max * steps_per_unit_x)

# Produce Z[t, x] matrix (over all x input values and all iterations)
# of sum-of-Gaussians
x_range = np.arange(-x_max, x_max, 1/steps_per_unit_x)
t_range = np.arange(1, num_samples+1)
Z = zeros((num_samples, total_x_steps))
for x in range(total_x_steps):
for t in range(num_samples):
Z[t, x] = weighted_sum_of_output_gaussians(x_range[x], t)
X, T = np.meshgrid(x_range, t_range)

# Plot the surface.
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter

fig = plt.figure(figsize=(10,6))
surf = ax.plot_surface(X, T, Z, rstride=1, cstride=1, cmap=cm.coolwarm, linewidth=0.25)

# Customize the z axis.
ax.set_zlim(0.0, 1)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))

# Add a color bar which maps values to colors.
fig.colorbar(surf, shrink=0.5, aspect=5)

ax.set_xlabel('x')
ax.set_ylabel('iterations')
ax.set_zlabel('PDF')

plt.show()

```

## Sequential K-Means

[This post is on the ‘Artificial Intelligence’ strand of this blogsite]

## Introduction: Context and Purpose

My recent post ‘Bayesian Origins of Neural Nets’ post concerned ‘A tutorial on the free-energy framework for modelling perception and learning’ by Rafal Bogacz (Prof of Computational Neuroscience, Oxford) that went from a single neuron performing Bayesian inference on a (Gaussian) noisy input source to multiple layers of multiple neurons.

At the early stages, we are trying to characterize a collection of data samples with just two parameters (mean and variance). But this is basically what Gaussian Mixture Models (GMMs) are all about. I commented:

“An exercise for the future: It would be nice to compare what we have here with online implementations of Gaussian Mixture Models, where data is fitted to one of a number of Gaussian curves, with the means and variances of those Gaussians being adjusted on-the-fly.”

This ‘Sequential K-Means’ blog post here is a stepping stone towards that blog post on Online GMMs (coming up next). K-Means is a simpler form of GMM where the variances are all effectively fixed. The stepping stones are:

1. K-Means
2. Sequential K-means
3. GMM
4. Online GMM.

Here, I cover (1) and (2). The post lays the foundations for (3) and (4) in terms of:

1. Input source (Python code)
2. The concept of ‘sequential’ (or ‘online’) versus ‘batch’ learning.
3. The concept of hyperparameters.
4. The concept of the iterative expectation-maximization algorithm
5. Output presentation (plotting of results) (Python code)

In a previous talk, I referred to definition of intelligence being:

‘exhibiting adaptive behaviour, appropriately responsive to complex and changing circumstances’

The aim here is to provide some understanding of how an (unsupervised) system can be intelligent: to show that it can learn to adapt itself within a changing environment. This is part of the wider ‘Big Spiky Audacious Goal’.

## Unsupervised Learning

The K-means algorithms used here are a form of Unsupervised Learning. We will provide many inputs (input samples) and the algorithm will end up being able to associate any sample with one of K clusters. In the examples here, there are 3 clusters i.e. K=3. This is not Supervised Learning in which we train it by saying something effectively like ‘here are some input values x = 1.234, 2.345, 3.456 and 4.567 and they belong to classes ‘A’, ‘C’, ‘A’ and ‘B’ respectively’. It has to work this out for itself; ‘A’, ‘B’ and ‘C’ are arbitrary labels with no particular significance. There may be significance such as ‘A’=’sheep’, ‘B’=’goat’ and ‘C’=cow but it knows nothing about this. It just segregates the inputs into clusters as best it can, where items in a cluster have more in common with other items in the same cluster than items in other clusters.

## One Dimension

To best see how K-means works, it is normally presented with 2-dimensional examples, such as the animated gif on the Wikipedia entry. Victor Lavrenko’s Youtube video explanation is very good:

Here, I am only going to provide a one-dimensional example:

1. to show it in its simplest form,
2. to be able to show how thing change over time, and (most importantly)
3. to use this as a stepping stone to more complex algorithms where having 2 dimensions is considerably more complex.

## Input Sample Distribution

Input samples will originate from one of three sources ‘A’, ‘B’ and ‘C’ with starting values  x=-8, x=0 and x=+8 respectively, except that these values are noisy (with a ‘Normal’, or ‘Gaussian’ distribution) and they may change over time (we will consider two cases: one where the values drift across to x=-7, x=-2 and x=6 respectively and the other case where there is no change). ‘C’ is 3 times more likely to occur than the other 2. But in the case where the inputs drift, there is a gradual shift over to ‘B’ being 3 times more likely than the others. Also in this case, the ‘A’ and ‘B’ inputs become noisier so that it becomes even harder to distinguish between the two (their average separation distance has also decreased, from 8 to 5).

We define each cluster with 3 numbers: a mean (average) and variance and a weight. The start and (possible) end distributions are described as follows:

```dist_set1 = [ # Gaussians clearly separable (relatively small variances)
{'mean': -8, 'variance': 1, 'weight': 0.2},
{'mean':  0, 'variance': 1, 'weight': 0.2},
{'mean':  8, 'variance': 1, 'weight': 0.6}]
dist_set2 = [ # Gaussians spread and merging
{'mean': -7, 'variance': 3, 'weight': 0.2},
{'mean': -2, 'variance': 2, 'weight': 0.6},
{'mean':  6, 'variance': 1, 'weight': 0.2}] ```

Note in both distributions, the weights add up to 1: the total area under all the Gaussians must always equal 1. This can be shown graphically – the plots below show how the input sample distribution drifts from dist1_set1 at iteration=1 to dist_set2 at iteration=400. Gaussian input distribution over time/iterations (view from iteration=0)

The plots above and below are the same but just from different angles.

In the plot above: iteration 0 (start time) is at the front where the starting distribution can be clearly seen. In the plot below: iteration 400 (end time) is at the front – the final distribution can be clearly seen, with the significant merging of ‘A’ and ‘B’ Gaussians. In both plots, the shift from ‘C’ being most likely to ‘B’ being the most likely is also clearly visible. Gaussian input distribution over time/iterations (view from iteration=400)

The code to generate these plots is given at the bottom.

## Iterations: Batch versus Online

Above, I present 400 input sample distributions from ‘iteration=1’ to ‘iteration=400’. What do we mean by iteration?

With ordinary unsupervised learning algorithms, there is a ‘batch’ of data to be thrown at the algorithm. There is no time significance to the samples: sample 1 does not occur before sample 2 (although the algorithm may generally actually ‘look’ at sample 1 before sample 2 in its processing).

Machine Learning algorithms are frequently ‘iterative’: they start with a guess and do something repeatedly. Over a number of steps (‘iterations’) , the ‘current guess’ improves to the point where it a good result. At each iteration, the ‘do’ thing will look at the input data (sometimes all of the data). There could not be any time significance to the samples: sample 1 and sample 2 are looked at on iteration 1, iteration 2 and so on.

Sometimes, these multiple looks at the data is not practical. There may be a huge amount of data to be processed by a computer with only a small memory. We want to look at sample 1, learn something (as much as possible) before moving on the sample 2. This sequential algorithm might not perform as well as a batch algorithm – but it would at least be practically possible.

This is the situation that we humans find ourselves in (indeed, any animal does). We as biological beings have some memory but we do not record all our sense inputs from the day we were born and subsequently go back to that mass of data to determine new actions. Batch learning is not biologically plausible and so not really of interest as part of the ‘Big Spiky Audacious Goal’. (We might not do this as biological beings but we are likely to do this to some extent in the narrower context of us being scientific creatures – see here.)

In this post I consider all the combination modes:

1. Batch learning, constant input distribution,
2. Batch learning, drifting input distribution,
3. Sequential learning, constant input distribution, and
4. Sequential learning, drifting input distribution.

Mode 1 is the normal starting point and mode 4 is where we want to go – to a biologically plausible learning that can adapt itself to a changing environment.

Mode 2 should be the same as mode 1 since there is no time significance to the samples, except that mode 1 will only use dist_set1 distribution and mode 2 will effectively have a smeared-out average of dist_set1 and dist_set2 distributions.

I will provide 400 input samples a batch learning will be done over 400 algorithm iterations even though this is far more than needed. But it allows the simple Python code to handle all the modes.

Sequential learning is more generally known as ‘Online’ learning. Having looked at ‘Sequential K-Means’, a future post will look at ‘Online Gaussian Mixture Models’. Perhaps there is a distinction to be made between the two:

• Online: adaptive learning through use
• Sequential: learning only sees the data once. There is no change during use: it wouldn’t do well in a changing environment.

but this is not normally made.

Historical note on terminology: Nowadays we get onto the internet by ‘going online’. When online, we can get information back immediately and respond to it immediately. Many years ago, the only way most people would have to access a computer was in batch mode: punched cards would be physically sent to the computer and be put in a queue. Answers would be physically sent back. It wouldn’t matter if the order of batch jobs in the queue were changed around – your results would be the same. The source of the terminology and the principles of online and batch learning are the same.

## Naïve K Means

The basic K-means algorithm comprises:

2. Assign: for each input ‘sample’, find out which cluster it is closest to and allocate it to that cluster.
3. Update: for each cluster, set its new location to the average of all the input samples allocated to it.
4. Iterate: repeat steps 2 and 3 until the locations of the clusters do not change.

It really can be that simple. This is very similar to Expectation-Maximization algorithms such as Gaussian Mixture Models that we will be looking at in the future – its ‘Expectation’ and ‘Maximization’ phases correspond to the ‘Assign’ and ‘Update’ phases above.

‘Naïve K-means’ is where the above description is implemented literally as described above, with no attempt to reduce redundant computations. For example, for i iterations with n input samples and K clusters, the number of times we have to calculate the distance between an input and the centre-point (‘centroid’) of a cluster is i x n x K. But for more samples, they will not move from one cluster to another from one iteration to the following one, so the number of distance calculations should be more like i x n after the first iteration. But computational efficiency is not of interest here.

```def find_closest_cluster(x, centroids):
closest_cluster, closest_distance = 999999, 999999 # initially invalid
for k in range(K): # no. clusters
distance = (x - centroids[k])**2 # Euclidean distance
if distance < closest_distance:
closest_cluster  = k
closest_distance = distance
return closest_cluster
```

## Initialization

The actual location of clusters at iteration #0 can have a big influence on how the samples get split into clusters. There are various techniques to make good choices and these are often random choices, but that is not of interest here either. We will just initialize centroids A, B and C to constant (deterministic) values A=-11, B=3 and C=11 because our input values will be within the range -11 and 11 and the third centroid is roughly in the middle.

If there were clusters at x=5, x=7 and x=9, the C centroid would shift left towards x=9, the B centroid would shift left towards x=5 and A would have nowhere to go. B will not be able to exclude x=5 and A will not be able to jump over A to get to x=7. Initializing as I have done reduces the chance of this problem occurring. This is less of a problem with a higher number of input dimensions. (You can try playing around with different initialization values in the Python code).

Results will be built up in an array ‘centroid_history’ to be able to plot changes over time:

```  centroid_history = zeros((1+num_iterations, K)) # for cluster k at iteration t
# 'num_iterations+1' because centroids[0,] is the *initial* guess...
centroid_history = [ -11, 3, 11 ]
```

## Assignment, Update and Iteration

The Naïve K-means algorithm is then just:

```def naive_k_means(num_iterations):
for t in range(1, num_iterations+1):
# Assignment phase: Assign each sample to its nearest cluster by calculating its distance to each centroid.
sums     = zeros((K)) # sum of x's for each cluster
counts   = zeros((K)) # count of number of samples in each cluster
for sample in sample_history:
allocated_cluster = find_closest_cluster(sample, centroid_history[t-1])
counts[allocated_cluster] += 1
sums[allocated_cluster]   += sample
# Update phase: Find new cluster center by taking the average of the assigned points.
print("Iteration %2d centroids:" % (t), end='')
for k in range(K):
centroid_history[t,k] = sums[k]/counts[k]
print(" %6.3f" % (centroid_history[t,k]), end='')
print("")
```

Here, the number of iterations computed is given as an input parameter rather than checking for negligible changes in centroids between iterations. The number of iterations needed for the centroid positions to settle is very small – just 3…

## Running Python code

Previously I have said to go to the repl.it site to run Python code online e.g. https://repl.it/@enaard/Python-3. But there is now Google’s ‘Colaboratory’. Colab allows you to create your own Jupyter notebooks which would be stored in a Google Drive account. Text and Python code can be mixed within this.  (Arguably, the post should have been done in Colab.

Create a new notebook with:

## Naïve K-Means with a Constant Input Distribution

Running the Python code at the bottom first produces output for K-means with an unchanging input distribution. What does this graph show?:

• Green dots are inputs from ‘sample=1’ to ‘sample=400’ in accordance with the input distribution, dist_set1.
• For each green dot plotted, there are blue ‘Y’ brackets which indicate μ+2σ and μ-2σ: the region that is 2 standard deviations either side of the mean, within which 95% of all the samples will fall, within the particular input Gaussian.
• Note that there are many more green dots in the top (average=8) Gaussian – because of the 60:20:20 weighting.
• The red ‘-‘ marks indicate the location of the centroids of the 3 clusters, from initialization (11, 3 and -11) and iteration=1 through to iteration=400. Results for Naive batch K-Means algorithm with a constant input probability distribution

The centroids actually converged after only 3 iterations. By actually running to 400 iterations we get a red line plotted where the centroid finally ends up. We can see it has settled in the center of each of the input Gaussians.

```# Running Naive K-means: constant distribution
# Iteration  1 centroids: -8.179  1.828  8.276
# Iteration  2 centroids: -8.179 -0.116  7.985
# Iteration  3 centroids: -8.179 -0.169  7.971
# Iteration  4 centroids: -8.179 -0.169  7.971
# Iteration  5 centroids: -8.179 -0.169  7.971
…
# Iteration 398 centroids: -8.179 -0.169  7.971
# Iteration 399 centroids: -8.179 -0.169  7.971
# Iteration 400 centroids: -8.179 -0.169  7.971
```

Why aren’t the centroids at  -8, 0 and +8? Perhaps more than 400 samples would allow it to converge closer. Try setting

```num_samples = 800
```

and you will see the centroids converging closer to the actual input means:

```# Iteration 800 centroids: -8.075 -0.065  7.943
```

## Sequential K-Means with a Constant Input Distribution

The Sequential K-means algorithm processes the samples as if they were in time order – which they may in fact be! For each cluster K, it maintains a centroid (mean, μ) position but also maintains a count, n, of the number of samples that have contributed to that centroid value. This allows it to update the mean value. Given a new sample value, x, to add to cluster k:

• the sum of all x so far is μk.nk.
• updated sum is (μk.nk + x).
• the new mean (centroid value) is (μk.nk + x)/(nk + 1), and
• the new count is nk + 1.

A different way of expressing the new centroid value is

Since: μk.nk can be expressed in a roundabout way as μk.(nk + 1) – μk, the new centroid value can be expressed as:

(μk.nk + x)/(nk + 1) = (μk.(nk + 1) + x – μk)/(nk + 1)

= μk.(nk + 1)/(nk + 1) + (x – μk)/(nk + 1)

= μk + (x – μk)/(nk + 1).

If we increment nk before the new centroid value calculation it means that the centroid value is just adjusted by (x – μk)/nk.

We end up with the code:

```def sequential_k_means(num_samples):
sample_count = zeros((K))
for t in range(1, num_samples+1):
closest_k = find_closest_cluster(sample_history[t-1], centroid_history[t-1])
sample_count[closest_k] += 1
centroid_history[t] = deepcopy(centroid_history[t-1]) # carry over from previous ...
# ... but then adjust the *closest* centroid ...
centroid_history[t,closest_k] += (sample_history[t-1] - centroid_history[t-1,closest_k])/sample_count[closest_k]
print("# Iteration %2d centroids:" % (t), end='')
for k in range(K):
print(" %6.3f" % (centroid_history[t,k]), end='')
print("")
```

And we end up with the results: Results for Sequential K-Means algorithm with a constant input probability distribution

The centroid values start off at 8, 3 and -8 but they rapidly home in on the average over all the samples. The end result looks the same but could be slightly different. The set of decisions of which cluster was closest to each sample point may have been slightly different for Sequential K-Means compared with Naïve K-Means. In an extreme case, consider if the samples were sorted into top-cluster first, then middle cluster, then bottom cluster. For Naïve K-Means this would make no difference – ordering makes no difference. But for Sequential K-Means, some top-cluster samples will be closer to the middle centroid (starting at x=3) than the top centroid (starting at x=8).

## Sequential K-Means with a Drifting Input Distribution

Despite its imperfections, Sequential K-Means tries to produce centroid values based equally on every sample that has been before. As shown below, the centroids take the long-term average when the input distribution is shifting from dist_set1 to dist_set2. Results for Sequential K-Means algorithm with a drifting input probability distribution

But that long-term averaging is not what we want from a biological point of view. We want the centroid positions to be representative of the current environment we are in.

## Forgetful Sequential K-Means

What we want is a short-term moving average: at each step we add in a factor for the new value but take something away of what was there before – it will tend to forget what happened a long time ago.

Previously, the centroid value was just adjusted by (x – μk)/nk. As the number of samples nk becomes large, hardly anything of the new is added and hardly anything of the old is taken away.

Instead of adjusting using nk, how about adjusting by a.(x – μk)  – as suggested in a page from Richard Duda’s Princeton class? The parameter (‘mathematical control knob’) a controls the actual balance between old and new. To demonstrate: the new value μk[t+1] = μk[t] + a.(x[t]μk) = (1-a).μk[t] + a.x[t]. That is: it is a weighted sum of the previous centroid and the new sample. Duda calls it ‘Forgetful Sequential K-Means’.

The code for this is:

```def forgetful_sequential_k_means(num_samples, a):
sample_count = zeros((K))
for t in range(1, num_samples+1):
closest_k = find_closest_cluster(sample_history[t-1], centroid_history[t-1])
sample_count[closest_k] += 1
centroid_history[t] = deepcopy(centroid_history[t-1]) # carry over from previous ...
centroid_history[t,closest_k] += (sample_history[t-1] - centroid_history[t-1,closest_k])*a
print("# Iteration %2d centroids:" % (t), end='')
for k in range(K):
print(" %6.3f" % (centroid_history[t,k]), end='')
print("")
```

The issue then is what value of a to use? If we set it too small, too much of the past still remains. Here with a=0.01: Results for Forgetful Sequential K-Means algorithm with a constant input probability distribution (slow learning hyper-parameter a=0.01)

As discussed previously, the middle centroid is being influenced (pulled upwards) by the samples from the top cluster.

A higher value of a produces the desired result for our particular rate of change of input environment: Results for Forgetful Sequential K-Means algorithm with a drifting input probability distribution and faster learning hyper-parameter a=0.1

As Duda says, this is recognizable as a Digital Signal Processing filter. It is a first-order Infinite Impulse Response (IIR) low-pass filter. What happened in the past is counted less the further back in time we go (it has an exponentially-decaying impulse response).

(Cheeky aside: the ‘Consciousness and Zombies’ post explains how such an algorithm might be conscious – outrageous!)

## Self-Organizing Feature Maps

Duda also mentions the similarity between Sequential K-Means and Kohonen’s Self-Organizing Feature Maps.

The Self-Organizing Feature Map algorithm is essentially:

• Having found the closest cluster, there is a search for what other clusters are within a particular distance, R (‘radius’ in Euclidean terms), of that ‘Best Matched Unit’ (BMU). All the clusters within that radius are updated, not just the BMU.
• The update rule is the same as for Forgetful Sequential K-Means: new value wk[t+1] = wk[t] + a.(x[t]wk) = (1-a).wk[t] + a.x[t]. Note I have used wk instead of μk here – which I will cover below.
• The process is iterated (like normal, non-sequential, K-Means), with the inputs provided randomly at each iteration.
• Over iterations, the ‘learning rate’, a, and the radius, R, are reduced.

The extra complications don’t make much sense in the context here where we have only consider 1 input i.e. it is one-dimensional. The Euclidean distance is square-root((x μk)2 ) which is just | x μk | – the absolute (positive) value of the difference. And in one dimension, this is also the same as the Manhattan distance, if that is desired.

My point here is only that, in the terminology of Self-Organizing Feature Maps, a cluster is called a ‘neuron’ and a centroid is called a ‘weight’. It uses the language of neural nets. And this maps back to my earlier ‘Bayesian Origins of Neural Nets’ post in which it was said neuronal weights encoded mean values and variances – a mean being called a centroid in K-means terms.

## The Full Python Code

```import matplotlib.pyplot as plt
import numpy as np
from numpy import sqrt, square, pi, exp, zeros, linspace
from copy  import deepcopy

num_samples = 400
# For sequential learning, an iteration is performed on every new sample.
# For batch learning, K-means will converge in a much smaller number of iterations.
# Here, we just use the same number regardless.
num_iterations = num_samples

##############################################################
# Input: samples from a sum of Gaussian distributions
##############################################################

dist_set1 = [ # Gaussians clearly separable (relatively small variances)
{'mean': -8, 'variance': 1, 'weight': 0.2},
{'mean':  0, 'variance': 1, 'weight': 0.2},
{'mean':  8, 'variance': 1, 'weight': 0.6}]
dist_set2 = [ # Gaussians spread and merging
{'mean': -7, 'variance': 3, 'weight': 0.2},
{'mean': -2, 'variance': 2, 'weight': 0.6},
{'mean':  6, 'variance': 1, 'weight': 0.2}]
# Compared with dist_set1, dist_set2 has:
# * the means closer together,
# * larger variances and
# * weighting drifting
# Note: The weights must add up to 1.0
K = len(dist_set1) # Number of input and K-means clusters

# Mean, variances and weights will shift over time. Use these vectors for this...
m_vec, v_vec, w_vec = zeros((3, num_samples)), zeros((3, num_samples)), zeros((3, num_samples))
# Samples will be drawn from a distribution that gradually shifts from dist1 to dist2.
def generate_distribution_vectors(num_samples, dist1, dist2):
# Parallel interpolation using linspace...
# means, variances and weights
for i in range(K):
m_vec[i]  = linspace(dist1[i]['mean'],     dist2[i]['mean'],     num_samples)
v_vec[i]  = linspace(dist1[i]['variance'], dist2[i]['variance'], num_samples)
w_vec[i]  = linspace(dist1[i]['weight'],   dist2[i]['weight'],   num_samples)

##############################################################
# 3-D plotting of sum of Gaussian distributions
##############################################################

if False:
# 'Normal' distribution function, a.k.a 'Gaussian'
def normal_pdf(data, mean, variance):
return 1/(sqrt(2*pi*variance)) * exp(-(square(data - mean)/(2*variance)))

# To produce a sum of Gaussians distribution for every time step
def sum_of_gaussians(x, t):
result = 0
for k in range(K):
result += normal_pdf(x, m_vec[k][t], v_vec[k][t]) * w_vec[k][t]
return result

x_max = 12 # x sample input is over range -xmax to +xmax
steps_per_unit_x = 10 # enough to get good surface plot
total_x_steps = (2 * x_max * steps_per_unit_x)

# Produce Z[t, x] matrix (over all x input values and all iterations)
# of sum-of-Gaussians
generate_distribution_vectors(num_samples, dist_set1, dist_set2)
x_range = np.arange(-x_max, x_max, 1/steps_per_unit_x)
t_range = np.arange(1, num_samples+1)
Z = zeros((num_samples, total_x_steps))
for x in range(total_x_steps):
for t in range(num_samples):
Z[t, x] = sum_of_gaussians(x_range[x], t)
X, T = np.meshgrid(x_range, t_range)

# Plot the surface.
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter

fig = plt.figure(figsize=(10,6))
############surf = ax.plot_surface(X, T, Z, rstride=1, cstride=1, cmap=cm.coolwarm, linewidth=0.25)
surf = ax.plot_surface(X, T, Z, rstride=1, cstride=1, cmap=cm.coolwarm, linewidth=0.25)

# Customize the z axis.
ax.set_zlim(0.0, 0.4)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))

# Add a color bar which maps values to colors.
fig.colorbar(surf, shrink=0.5, aspect=5)

ax.set_xlabel('x')
ax.set_ylabel('iterations')
ax.set_zlabel('PDF')

plt.show()
exit()

###############################
# Options
###############################

# devs: the variance bars are plotted this no. of std deviations away from mean
devs = 2 # (1 =&gt; 68%, 2 =&gt; 95%, 3 =&gt; 99.7%)

np.random.seed(0) # for repeatable results

# Uses m_vec, v_vec and w_vec vectors of mean, variance and weight.
def generate_samples(num_samples):
for t in range(num_samples):
prob = list(w_vec[:, t])
source = np.random.choice(a=list(range(K)), size=1, p = list(w_vec[:, t]))
mean   =      m_vec[source, t]
stddev = sqrt(v_vec[source, t])
means_history.append( mean )
lower_history.append( mean - stddev*devs )
upper_history.append( mean + stddev*devs )
sample_history.append( np.random.normal(mean, stddev))

# Output: Graphical plot
def scatter_plot(title, iterations):
plt.figure(figsize=(10,6))
axes = plt.gca()
plt.xlabel("samples / iterations")
plt.ylabel("x")
plt.title(title)
steps = list(range(len(means_history)))
plt.scatter(steps, upper_history,   color='navy',  s=30, marker="1", label=("μ+%dσ" % (devs)))
plt.scatter(steps, lower_history,   color='navy',  s=30, marker="2", label=("μ-%dσ" % (devs)))
plt.scatter(steps, sample_history,  color='green', s=30, marker=".", label="x[t]")
steps = list(range(iterations+1))
for k in range(K):
plt.scatter(steps, centroid_history[0:iterations+1,k], color='red',  s=30, marker="_", label=("c%d[t]" % (k)))
#plt.legend(loc='upper left')
plt.show()

##################################################################
# Algorithm: K-means
##################################################################

def find_closest_cluster(x, centroids):
closest_cluster, closest_distance = 999999, 999999 # initially invalid
for k in range(K): # no. clusters
distance = (x - centroids[k])**2 # Euclidean distance
if distance <span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span><span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>&lt; closest_distance:
closest_cluster  = k
closest_distance = distance
return closest_cluster

def naive_k_means(num_iterations):
for t in range(1, num_iterations+1):
# Assignment phase: Assign each x_ix i to nearest cluster by calculating its distance to each centroid.
sums     = zeros((K)) # sum of x&#039;s for each cluster
counts   = zeros((K)) # count of number of samples in each cluster
for sample in sample_history:
allocated_cluster = find_closest_cluster(sample, centroid_history[t-1])
counts[allocated_cluster] += 1
sums[allocated_cluster]   += sample
# Update phase: Find new cluster center by taking the average of the assigned points.
print(&quot;# Iteration %2d centroids:&quot; % (t), end=&#039;&#039;)
for k in range(K):
centroid_history[t,k] = sums[k]/counts[k]
print(&quot; %6.3f&quot; % (centroid_history[t,k]), end=&#039;&#039;)
print(&quot;&quot;)

def sequential_k_means(num_samples, a=0, forgetful=False):
sample_count = zeros((K))
for t in range(1, num_samples+1):
closest_k = find_closest_cluster(sample_history[t-1], centroid_history[t-1])
sample_count[closest_k] += 1
centroid_history[t] = deepcopy(centroid_history[t-1]) # carry over from previous ...
# ... but then adjust the *closest* centroid ...
if forgetful:
# c[t]=c[t-1]+a(x[t]-c[t-1]) = (1-a).c[t-1]+a.x[t]
centroid_history[t,closest_k] += (sample_history[t-1] - centroid_history[t-1,closest_k])*a
else: # normal sequential K-means...
centroid_history[t,closest_k] += (sample_history[t-1] - centroid_history[t-1,closest_k])/sample_count[closest_k]
print(&quot;# Iteration %2d centroids:&quot; % (t), end=&#039;&#039;)
for k in range(K):
print(&quot; %6.3f&quot; % (centroid_history[t,k]), end=&#039;&#039;)
print(&quot;&quot;)

##################################################################
# Run numerous combinations of distribution and algorithm
##################################################################

test_runs = [ &quot;Naive K-means: constant distribution&quot;,
&quot;Sequential K-means: constant distribution&quot;,
&quot;Sequential K-means: drifting distribution&quot;,
# &quot;Naive K-means: drifting distribution&quot;,
&quot;Forgetful Sequential K-means (a=0.01): constant distribution&quot;,
&quot;Forgetful Sequential K-means (a=0.1): drifting distribution&quot; ]

for test_name in test_runs:
print(&quot;# Running %s&quot; % (test_name))
if &quot;a=0.01&quot; in test_name: hyperparameter_a = 0.01 # hyper-parameter for Forgetful Sequential K-means
if &quot;a=0.1&quot;  in test_name: hyperparameter_a = 0.1
# Initialize...
means_history, upper_history, lower_history, sample_history = [ ], [ ], [ ], [ ]
if &quot;constant&quot; in test_name: # input sequence from distribution dist_set1
generate_distribution_vectors(num_samples, dist_set1, dist_set1)
elif &quot;drifting&quot; in test_name: # input sequence drifting from distribution dist_set1 to dist_set2
generate_distribution_vectors(num_samples, dist_set1, dist_set2)
generate_samples(num_samples)

# This is the container for the results, to be able to plot the position of the centroids over time
centroid_history = zeros((1+num_iterations, K)) # for cluster k at iteration t
# &#039;num_iterations+1&#039; because centroids[0,] is the *initial* guess...
centroid_history = [ -11, 3, 11 ]

if &quot;Naive&quot; in test_name: # standard (batch) K-means
naive_k_means(num_iterations) # Will actually converge in about 2 iterations
elif &quot;Forgetful&quot; in test_name:
sequential_k_means(num_samples, hyperparameter_a, forgetful=True)
elif &quot;Sequential&quot; in test_name:
sequential_k_means(num_samples)

if &quot;Naive&quot; in test_name: # standard (batch) K-means
scatter_plot(test_name, num_iterations)
else: # sequential
scatter_plot(test_name, num_samples)
```
Posted in Uncategorized | | 3 Comments

## Approximating Normal

[[[ POST-EDIT: The formatting has been corrupted within the WordPress editor. See https://github.com/headbirths/neuro/blob/master/approximating_normal.py for code]]]

[Normally, I would say ‘This post is on the ‘xyz’ strand of this blogsite’. In this case, I say: For this post I only say it is parenthetically related to the ‘Artificial Intelligence’ strand of this blogsite]

A future blog post will be about Gaussian Mixture Models, where a signal is represented as the sum of a number of Normal distributions (also known and Gaussian distributions and more colloquially known as ‘bell curves’), their probability density function (PDF) shape defined by μ and σ: a mean (average) and standard deviation (how spread out) respectively.

Here, I provide a Python program that approximates a Normal distribution with quadratic sections. This could form the basis of a very efficient implementation in tiny/deeply-embedded systems. It provides an example of using an optimizer to determine the coefficients of the quadratics. In this case, it uses the ‘SLSQP’ Sequential_quadratic_programming (SQP) algorithm within scipy.optimize.minimize), inspired by AlphaOpt’s Youtube example:

(The term ‘quadratic’ within SQP is not related to the fact that the optimization being done happens to involve quadratics; that is purely coincidental. Other scipy.minimize algorithms could be used. This just happens to be the one inherited from AlphaOpt’s example.)

The method could easily be modified to produce efficient embedded code implementations of other functions such as square root or reciprocal.

I’ll start by providing the final result straight away:

```def quadratic(x, a, b, c):
return a*x**2 + b*x + c

a_coeffs = [ -5890, -2512,   2841,  1309,   216,  -18, -22, -11,  -8,   -5,   -9]
b_coeffs = [  -166, -3521, -14516, -9532, -3347, -750, -85,  33,  59,   57,   89]
c_coeffs = [ 13080, 13928,  19666, 15533,  7516, 2512, 582,  40, -98, -164, -210]

import math
def normal_16bit(x, mean, reciprocal_of_std_dev):
norm = abs(x-mean) * reciprocal_of_std_dev # normalized
segment = math.floor(norm * 2) # which 1/2-std-dev segment does the item belong to?
if (segment &amp;gt;= 11):
return 0 # Result is zero beyond the last segment
else:
return reciprocal_of_std_dev * quadratic(norm, a_coeffs[segment],  b_coeffs[segment],  c_coeffs[segment])

This produces a signed 16-bit result (output value is scaled by 32768) but has floating-point inputs – some more work would be needed to truly get tiny embedded code (not least, re-coding in C). The difficult part here of course has been in working out what the a, b and c coefficients should be.
First, define the correct, reference function:
from numpy import pi, exp, sqrt, square, linspace
def normal_pdf(x, mean: float, std_dev: float):
return 1/(sqrt(2*pi)*std_dev) * exp(-(square((x - mean)/std_dev)/2))

for x in linspace(-3, 6, 19):
print("# f(%4.1f | μ=2, σ=1) = %.5f" % (x, normal_pdf(x, 2, 1)))

Running this (reminder: you can run code online at, for example https://repl.it/@enaard/Python-3) would produce:
# f(-3.0 | μ=2, σ=1) = 0.00000
# f(-2.5 | μ=2, σ=1) = 0.00002
# f(-2.0 | μ=2, σ=1) = 0.00013
# f(-1.5 | μ=2, σ=1) = 0.00087
# f(-1.0 | μ=2, σ=1) = 0.00443
# f(-0.5 | μ=2, σ=1) = 0.01753
# f( 0.0 | μ=2, σ=1) = 0.05399
# f( 0.5 | μ=2, σ=1) = 0.12952
# f( 1.0 | μ=2, σ=1) = 0.24197
# f( 1.5 | μ=2, σ=1) = 0.35207
# f( 2.0 | μ=2, σ=1) = 0.39894
# f( 2.5 | μ=2, σ=1) = 0.35207
# f( 3.0 | μ=2, σ=1) = 0.24197
# f( 3.5 | μ=2, σ=1) = 0.12952
# f( 4.0 | μ=2, σ=1) = 0.05399
# f( 4.5 | μ=2, σ=1) = 0.01753
# f( 5.0 | μ=2, σ=1) = 0.00443
# f( 5.5 | μ=2, σ=1) = 0.00087
# f( 6.0 | μ=2, σ=1) = 0.00013

Calculating with floating-point exponents, square roots and divisions, as above, is ‘expensive’ in terms of processor time and memory compared with fixed-point addition, subtraction and multiplication operations. Can we implement an approximation of this Normal PDF function using only ‘cheap’ operations?
Polynomial Approximation
You could try to approximate the distribution with a polynomial. The following paper attempts this:
“An Efficient Polynomial Approximation to the Normal Distribution Function and Its Inverse Function”
with a 9th-order polynomial which would be implemented in Python with:
def normal_polynomial_f5(x, mean, std_dev):
n = (x-mean)/std_dev # normalize
result = 0.5
result += 1.196826841   * n
result -= 0.00723328282 * n**2
result -= 1.196826841   * n**3
result -= 0.5893704135  * n**4
result += 0.264946944   * n**5
result -= 3.026777282   * n**6
result -= 0.7621876586  * n**7
result += 1.616586552   * n**8
result -= 0.4984396913  * n**9
return result

But that doesn’t work for me! (What am I missing?) Besides, very high powers lead to numerical accuracy issues.
Here, I provide an alternative approximation by cutting the bell curve into a number of segments (at ½ standard deviation intervals) and fitting quadratic equations to them. The algorithm is:

Normalize to a N(μ=0, σ=1) distribution,
Flip negative inputs to positive,
Work out how many complete half-standard deviations that x is away from 0.
Use this number to look up a table of quadratic coefficients a, b and c
Calculate the result using the quadratic function y=ax2+bx+c.

One other trick is to express ‘spread’ in terms of 1/σ rather than standard deviation σ, or variance σ2 or ‘precision’ p= 1/σ2 to avoid some division.
from numpy import pi, exp, sqrt, square, linspace
from math import floor
def normal_pdf(x, mean: float, std_dev: float):
return 1/(sqrt(2*pi)*std_dev) * exp(-(square((x - mean)/std_dev)/2))

# Quadratic coefficients (floats to 6 decimal places)
a = [ -0.179459,  -0.066487,   0.046581,   0.103770,   0.039032,   0.008262,   0.000899,  -0.000266,  -0.000436,  -0.000493,  -0.000528 ]
b = [ -0.005188,  -0.122716,  -0.342654,  -0.514332,  -0.247844,  -0.071063,  -0.012733,   0.000591,   0.003447,   0.004595,   0.005505 ]
c = [  0.399174,   0.430580,   0.538260,   0.668233,   0.392328,   0.142497,   0.034105,   0.001928,  -0.006716,  -0.010684,  -0.014344 ]

# Quadratic coefficients (table as integers, quantized to 16-bit]
a16 = [float(i)/32768.0 for i in [-5881, -2179,   1526,   3400,   1279,    271,     29,     -9,    -14,    -16,    -17 ]]
b16 = [float(i)/32768.0 for i in [-170,  -4021, -11228, -16854,  -8121,  -2329,   -417,     19,    113,    151,    180 ]]
c16 = [float(i)/32768.0 for i in [13080, 14109,  17638,  21897,  12856,   4669,   1118,     63,   -220,   -350,   -470 ]]

num_segments = len(a)

return a*x**2 + b*x + c

def normal_approximation(unnormalized_x, a, b, c, mean, reciprocal_of_std_dev):
x = abs(unnormalized_x - mean) * reciprocal_of_std_dev # normalized: shift and stretch width-wise
segment = floor(x * 2) # which 1/2-std-dev segment does the item belong to?
if (segment &amp;gt;= num_segments):
return 0 # Result is zero beyond the last segment
else:
# Includes normalizing by factoring in std. dev.: shrink height-wise
return (a[segment]*x**2 + b[segment]*x + c[segment]) * reciprocal_of_std_dev

print('# Combined table:           Reference    Approx      Quantized')
for x in linspace(-3, 6, 19):
print("# f(%4.1f | μ=2, σ=1)" % (x), end='')
print("  = %10.4f" % (normal_pdf          (x,                2,   1)), end='')
print(" -&amp;gt; %10.4f" % (normal_approximation(x, a,   b,   c,   2, 1/1)), end='')
print(" -&amp;gt; %10.4f" % (normal_approximation(x, a16, b16, c16, 2, 1/1)))

This produces the results:
# Combined table:           Reference    Approx      Quantized
# f(-3.0 | μ=2, σ=1)  =     0.0000 -&amp;gt;    -0.0000 -&amp;gt;     0.0002
# f(-2.5 | μ=2, σ=1)  =     0.0000 -&amp;gt;     0.0000 -&amp;gt;     0.0002
# f(-2.0 | μ=2, σ=1)  =     0.0001 -&amp;gt;     0.0001 -&amp;gt;     0.0002
# f(-1.5 | μ=2, σ=1)  =     0.0009 -&amp;gt;     0.0007 -&amp;gt;     0.0006
# f(-1.0 | μ=2, σ=1)  =     0.0044 -&amp;gt;     0.0040 -&amp;gt;     0.0039
# f(-0.5 | μ=2, σ=1)  =     0.0175 -&amp;gt;     0.0165 -&amp;gt;     0.0165
# f( 0.0 | μ=2, σ=1)  =     0.0540 -&amp;gt;     0.0528 -&amp;gt;     0.0528
# f( 0.5 | μ=2, σ=1)  =     0.1295 -&amp;gt;     0.1302 -&amp;gt;     0.1302
# f( 1.0 | μ=2, σ=1)  =     0.2420 -&amp;gt;     0.2422 -&amp;gt;     0.2422
# f( 1.5 | μ=2, σ=1)  =     0.3521 -&amp;gt;     0.3526 -&amp;gt;     0.3526
# f( 2.0 | μ=2, σ=1)  =     0.3989 -&amp;gt;     0.3992 -&amp;gt;     0.3992
# f( 2.5 | μ=2, σ=1)  =     0.3521 -&amp;gt;     0.3526 -&amp;gt;     0.3526
# f( 3.0 | μ=2, σ=1)  =     0.2420 -&amp;gt;     0.2422 -&amp;gt;     0.2422
# f( 3.5 | μ=2, σ=1)  =     0.1295 -&amp;gt;     0.1302 -&amp;gt;     0.1302
# f( 4.0 | μ=2, σ=1)  =     0.0540 -&amp;gt;     0.0528 -&amp;gt;     0.0528
# f( 4.5 | μ=2, σ=1)  =     0.0175 -&amp;gt;     0.0165 -&amp;gt;     0.0165
# f( 5.0 | μ=2, σ=1)  =     0.0044 -&amp;gt;     0.0040 -&amp;gt;     0.0039
# f( 5.5 | μ=2, σ=1)  =     0.0009 -&amp;gt;     0.0007 -&amp;gt;     0.0006
# f( 6.0 | μ=2, σ=1)  =     0.0001 -&amp;gt;     0.0001 -&amp;gt;     0.0002

which shows very good agreement (very good for engineering purposes) between the ‘approx’ approximation function and the ‘reference’ values. Furthermore, when the quadratic coefficients are quantized to fit into 16-bit words, the results are only slightly worse.
To compute the approximation requires only:

3 table look-ups
one condition
one shift-left
4 multiplications
1 subtraction

The computation could be done with less than 11 segments – of varying widths – with no loss of accuracy. This would consume less memory but would require extra operations (to work out which table lookup value to choose).
Calculating Coefficients
The cleverness of course is in the coefficient values. How were these determined?
Answer: by using some other piece of code to find the best values, this using an optimizer. In this case, that optimizer is scipy.optimize.minimize and is based on the example optimization here.
Running the code (given at the bottom of this post) first produces:
# Segment partitioning (11 segments):
# Segment #0 from 0.000 to 0.500
# Segment #1 from 0.500 to 1.000
# Segment #2 from 1.000 to 1.500
# Segment #3 from 1.500 to 2.000
# Segment #4 from 2.000 to 2.500
# Segment #5 from 2.500 to 3.000
# Segment #6 from 3.000 to 3.500
# Segment #7 from 3.500 to 4.000
# Segment #8 from 4.000 to 4.500
# Segment #9 from 4.500 to 5.000
# Segment #10 from 5.000 to 5.500
##### BEFORE #####
# Segment #0 y=(0.00000000)x^2 + (0.00000000)x + (0.00000000)
# Segment #1 y=(0.00000000)x^2 + (0.00000000)x + (0.00000000)
# Segment #2 y=(0.00000000)x^2 + (0.00000000)x + (0.00000000)
# Segment #3 y=(0.00000000)x^2 + (0.00000000)x + (0.00000000)
# Segment #4 y=(0.00000000)x^2 + (0.00000000)x + (0.00000000)
# Segment #5 y=(0.00000000)x^2 + (0.00000000)x + (0.00000000)
# Segment #6 y=(0.00000000)x^2 + (0.00000000)x + (0.00000000)
# Segment #7 y=(0.00000000)x^2 + (0.00000000)x + (0.00000000)
# Segment #8 y=(0.00000000)x^2 + (0.00000000)x + (0.00000000)
# Segment #9 y=(0.00000000)x^2 + (0.00000000)x + (0.00000000)
# Segment #10 y=(0.00000000)x^2 + (0.00000000)x + (0.00000000)
# Cost function: 18.13364415
# Maximum positive error -0.00000000 at -10.000 is 0.00000000 instead of 0.00000000.
# Maximum negative error -0.39894228 at 0.000 is 0.00000000 instead of 0.39894228.

This just shows the boundaries (on the x axis) and quadratic coefficients of each segment. Here, we don’t need to give the optimizer any help to find a solution: all the coefficients are just set to zero. If we use the approximation function with these coefficients, the answer will be zero regardless of the x value. This is borne out in the error plot: Initial error function when all quadratic coefficients are zero
The error e=a-r: ‘a’ is the approximation function output and ‘r’ is the reference function output that we are trying to fit the approximation to. Since ‘a’ flatlines, ‘e’ is the reference’s Normal distribution flipped upside down.
Optimization
We are told that the very bad approximation above has a cost function result of 18.13364415. This is determined by computing a sum-of-squares across all values of x from -5 to 5. The optimizer tries to modify the coefficients in order to find a set that has as low a cost as possible.
Optimizing takes a short while but then produces:
##### OPTIMIZING #####
Optimization terminated successfully.    (Exit mode 0)
Current function value: 0.00018880714614915984
Iterations: 36
Function evaluations: 1313

And we can then see what it has ‘found’ (produced):
##### AFTER #####
# Maximum positive error 0.00189853 at -1.000 is 0.24386926 instead of 0.24197072.
# Maximum negative error -0.00252195 at -2.000 is 0.05146902 instead of 0.05399097.
# Segment #0 y=(-0.17973425)x^2 + (-0.00505443)x + (0.39916185)
# Segment #1 y=(-0.07666328)x^2 + (-0.10744328)x + (0.42505809)
# Segment #2 y=(0.08668947)x^2 + (-0.44298249)x + (0.60016227)
# Segment #3 y=(0.03994489)x^2 + (-0.29089544)x + (0.47401580)
# Segment #4 y=(0.00659829)x^2 + (-0.10214379)x + (0.22936344)
# Segment #5 y=(-0.00053429)x^2 + (-0.02287305)x + (0.07667299)
# Segment #6 y=(-0.00066900)x^2 + (-0.00259672)x + (0.01775457)
# Segment #7 y=(-0.00032765)x^2 + (0.00101441)x + (0.00121242)
# Segment #8 y=(-0.00025578)x^2 + (0.00180205)x + (-0.00297611)
# Segment #9 y=(-0.00014274)x^2 + (0.00173502)x + (-0.00501367)
# Segment #10 y=(-0.00028692)x^2 + (0.00272845)x + (-0.00640868)
# Cost function: 0.00018881
# a = [ -0.179734,  -0.076663,   0.086689,   0.039945,   0.006598,  -0.000534,  -0.000669,  -0.000328,  -0.000256,  -0.000143,  -0.000287, ]
# b = [ -0.005054,  -0.107443,  -0.442982,  -0.290895,  -0.102144,  -0.022873,  -0.002597,   0.001014,   0.001802,   0.001735,   0.002728, ]
# c = [  0.399162,   0.425058,   0.600162,   0.474016,   0.229363,   0.076673,   0.017755,   0.001212,  -0.002976,  -0.005014,  -0.006409, ]

We can see that:

The cost function is far, far less.
The worst output is 0. 002522 away from the correct result. That should be good enough for a large number of applications.
The ‘a’, ‘b’ and ‘c’ coefficient vectors are handily printed out in a form that they can be readily pasted into the usage code, at the top.

Quantizing
That is using floating-point numbers. To be fast and efficient, we want to use fixed-point numbers. The effect of quantizing to 8and 16-bit coefficients is shown below:

The 8-bit result has a worst error of 1.56%.
The 16-bit result is only a tiny bit worse than unquantized.

##### Quantized to 8-bit signed fixed point #####
# [-23,-1,51,-10,-14,54,11,-57,77,5,-37,61,1,-13,29,0,-3,10,0,0,2,0,0,0,0,0,0,0,0,-1,0,0,-1,]
# Maximum positive error 0.01439278 at 3.400 is 0.01562500 instead of 0.00123222.
# Maximum negative error -0.00782848 at -4.500 is -0.00781250 instead of 0.00001598.
##### Quantized to 16-bit signed fixed point #####
# [-5890,-166,13080,-2512,-3521,13928,2841,-14516,19666,1309,-9532,15533,216,-3347,7516,-18,-750,2512,-22,-85,582,-11,33,40,-8,59,-98,-5,57,-164,-9,89,-210,]
# Maximum positive error 0.00189524 at -1.000 is 0.24386597 instead of 0.24197072.
# Maximum negative error -0.00253833 at -2.000 is 0.05145264 instead of 0.05399097.
# a = [-5890, -2512, 2841, 1309,  216,  -18,  -22,  -11,   -8,   -5,   -9, ]
# b = [-166, -3521, -14516, -9532, -3347, -750,  -85,   33,   59,   57,   89, ]
# c = [13080, 13928, 19666, 15533, 7516, 2512,  582,   40,  -98, -164, -210, ]

Again, the a, b and c coefficients are printed out for pasting into the standalone Python function.
Using the Normal Approximation Function
The Python program goes on to show the function’s usage (with 16-bit quantized coefficients), with a mean of 3 and standard deviation of 2. Using the Normal approximation
You can just about see the red reference underneath the blue approximation. The error plot is shown below. The optimization has produced good results around the center of the distribution (around x=3) but the offsets in the segment around x=-7 (about 5 standard deviations from the mean) suggest there is more than can be done here. Error (mismatch between approximation and the reference)
##### USAGE #####
# Maximum positive error 0.00094762 at 1.000 is 0.12193298 instead of 0.12098536.
# Maximum negative error -0.00126916 at -1.000 is 0.02572632 instead of 0.02699548.

Next Steps

The doesn’t take into account the quantization on the numbers other than the coefficients (e.g. into a 16-bit format).
The approach could be applied to other functions such as square root and reciprocal.

The Full Python Code
[[[ POST-EDIT: The code below has been corrupted by Python formatting within the WordPress editor. Instead, see https://github.com/headbirths/neuro/blob/master/approximating_normal.py ]]]
import numpy as np
import math
import matplotlib.pyplot as plt
from scipy.optimize import minimize

# Calculation of fixed point coefficients for
# of the Normal (Gaussian) Distribution function
# using the Python 'scipy.optimize.minimize' optimization function
# for fast, simple calculation on e.g. microcontrollers

# Optimization (fitting) conditions
# Global: normalized when doing fitting.
the_mean = 0
the_reciprocal_of_std_dev = 1
# Using the reciprocal means there aren't any division operations

# Naming of threshold and coefficients: segment n covers n-1.5 to n-0.5
segment_thresholds = np.linspace(0,5.5,12)
num_segments = len(segment_thresholds)-1
print('# Segment partitioning (%d segments):' % (num_segments))
for i in range(num_segments):
print('# Segment #%d from %.3f to %.3f' % (i, segment_thresholds[i], segment_thresholds[i+1]))

# Constant:
one_over_sqrt_2pi = 0.3989422804014327 # 1/(np.sqrt(2*np.pi))

def pdf_vector(x_vector, mean: float, reciprocal_of_std_dev: float):
s1 = one_over_sqrt_2pi * reciprocal_of_std_dev
y = []
for x in x_vector:
s2 = np.exp(-(np.square((x - mean)*reciprocal_of_std_dev)/2))
y.append(s1 * s2)
return y

# For a single segment, find the sum of squares of errors.
def cost_function(xrange, a, b, c, mean, reciprocal_of_std_dev):
# Wanting to *minimize* cost
step_size = 1/128 # x resolution:step size (100 steps per half standard deviation)
cost    = 0.0
x = xrange
while x = num_segments):
y_vector.append(0) # Result is zero beyond the last segment
else:
y_vector.append(quadratic(norm, coeffs[3*segment],  coeffs[3*segment+1],  coeffs[3*segment+2])  * reciprocal_of_std_dev)
return y_vector

def find_largest_errors(coeffs):
bins = np.linspace(-10.0,10.0,201)
reference = pdf_vector(bins, the_mean, the_reciprocal_of_std_dev)
approx    = approx_normal_vector(bins, coeffs, the_mean, the_reciprocal_of_std_dev)
# https://www.geeksforgeeks.org/python-maximum-minimum-elements-position-list/
err = [m - n for m,n in zip(approx, reference)] # Subtract *list* elements
if sign=='positive' : yerr = max(err)
if sign=='negative' : yerr = min(err)
xval = bins[err.index(yerr)]
yref = pdf_vector([xval], the_mean, the_reciprocal_of_std_dev)
yact = approx_normal_vector([xval], coeffs, the_mean, the_reciprocal_of_std_dev)
print("# Maximum %s error %.8f at %.3f is %.8f instead of %.8f." % (sign, yerr, xval, yact, yref))

def report_list(x_array, name, precision=6):
print('# %s = [' % (name), end='')
for x in x_array:
if precision==0:
print('%4.0f, ' % (x), end='')
else:
print('%10.6f, ' % (x), end='')
print(']')

def split_up(coeffs):
a, b, c = [], [], []
for i in range(len(coeffs)):
if i % 3 == 0: a.append(coeffs[i])
if i % 3 == 1: b.append(coeffs[i])
if i % 3 == 2: c.append(coeffs[i])
return a, b, c

def plot_results(coeffs, show = 'pdf'):
bins = np.linspace(-10.0,10.0,201)
plt.figure(figsize=(10,7))
axes = plt.gca()
plt.xlabel("\$x\$")
reference = pdf_vector(bins, the_mean, the_reciprocal_of_std_dev)
approx    = approx_normal_vector(bins, coeffs, the_mean, the_reciprocal_of_std_dev)
if show == 'pdf':
plt.ylabel("pdf")
plt.plot(bins, reference, color='red',  label="True pdf")
plt.plot(bins, approx,    color='navy', label="Approx pdf")
elif show == 'absolute error':
err = [m - n for m,n in zip(approx, reference)] # Subtract *list* elements
plt.ylabel("abs error")
plt.plot(bins, err,       color='navy', label="Error")
plt.legend()
plt.show()

def quantize_coefficients(coeffs, bits):
y = []
print("# [", end='')
for x in coeffs:
integer = round(x*(2**bits))
y.append( integer/(2**bits) )
print("%d," % (integer), end='')
print("]")
return y

print('##### BEFORE #####')
# Initial Guess, pre-optimization
# Lazy here: just set every coefficient set to zero
x0 = np.zeros([3*num_segments]) # 3 coefficients (a, b and c) per segment
report_coefficients(x0)
print('# Cost function: %.8f' % (calcCost(x0)))
plot_results(x0, show='absolute error')

print('##### OPTIMIZING #####')
# Non-linear: use Sequential Least SQuares Programming (SLSQP)
sol = minimize(objective, x0, method='SLSQP', options={'disp':True})
xOpt = sol.x
errOpt = sol.fun

print('##### AFTER #####')
find_largest_errors(xOpt)
report_coefficients(xOpt)
print('# Cost function: %.8f' % (calcCost(xOpt)))
a, b, c = split_up(xOpt)
report_list(a, 'a')
report_list(b, 'b')
report_list(c, 'c')

print('##### Quantized to 8-bit signed fixed point #####')
quantized_xOpt = quantize_coefficients(xOpt, 8-1)
find_largest_errors(quantized_xOpt)

print('##### Quantized to 16-bit signed fixed point #####')
quantized_xOpt = quantize_coefficients(xOpt, 16-1)
find_largest_errors(quantized_xOpt)
a, b, c = split_up(quantized_xOpt)
report_list([n*32768 for n in a], 'a', precision=0)
report_list([n*32768 for n in b], 'b', precision=0)
report_list([n*32768 for n in c], 'c', precision=0)

print('##### USAGE #####')
# Example usage:
the_mean = 3
the_reciprocal_of_std_dev = 1/2 # i.e. variance=4
plot_results(quantized_xOpt)
plot_results(quantized_xOpt, show='absolute error')
find_largest_errors(quantized_xOpt)

```

## Economics for Hard Times, part 2

[This post is on the ‘Politics/Economics’ strand of this blogsite]

Following on from Part One, this Part 2 concludes the talk by looking at:

• welfare for those in the bottom half of society (how it is best provided), and
• the polarization of society (and particularly how those of us in the top half of society should help de-polarize) ## Welfare

Welfare can be provided in a number of ways. Cash is the most common. This is generally means-tested and commonly associated with work, such as Earned Income Tax Credits. non-means-tested Universal Basic Income and work-related Negative Income Tax are increasingly being talked about. Goods may be provided directly. This could be in the form of coupons. More commonly services are provided directly, frequently as ‘Universal Basic Services’. Less commonly, there is state intervention to provide jobs which provide peoples’ income such as Workfare programs. Also increasingly being talked about is the government providing jobs directly, through the Green New Deal or as ‘Employer of Last Resort’. From this, it can be seen there is the close association between getting benefits and having to work.

## UBI

Universal Basic Income (UBI) is not related to work in any way.

“Many economists, going back at least to Milton Friedman, approve of UBI’s hands-off attitude. As we discussed, many of them are acculturated to assume people know best what is good for themselves, and see no reason to believe a government  bureaucrat would know better. For them, handing over cash to welfare recipients is obviously the right thing to do.”

There have been many trials of UBI. The most famous:

“was tried out I Finland in 2017 and 2018, where 2000 unemployed workers were randomized to get UBI replacing all traditional assistance programs (housing, employment assistance, etc.) The remaining 173, 222 formed the control group. Early results suggest UBI recipients are happier.”

The Finnish trial is widely reported as being a failure, generally because

“There is no difference in earnings between the two groups”

but is not the increase in happiness that is relevant here?

The trial was stopped largely because it was just a trial. Banerjee and Duflo have much experience in such trials – their Nobel Prizes were awarded for their development of Randomized Controlled Trials in poor countries. But a general observation from them is that:

1. Trials are all too frequently not conducted widely enough
2. Trials are all too frequently not conducted for long enough, and
3. Conclusions were drawn too soon after completion of a trial.

The issue with UBI of course is how ‘basic’ is ‘basic’?. It is not true that it can be funded by just taxing the top 20%. Universal Basic Income is popular with the public, until working people are told how expensive it will be (what tax rate it will incur) – even though they may not be any worse off. If the level is set too low to be politically affordable, it will not perform its function of providing universal welfare and will need to be supplemented. Yet the couple still advocate an ‘Universal Ultra-Basic Income’ (UUBI).

To make it more affordable (less universal!) they suggest that:

“it is fairly easy to exclude 25% of people from the program. It may indeed be possible to introduce a mild form of self-selection. Requiring each beneficiary to visit an ATM every week and put their biometric ID into the system, whether or not they take out the money, would have the dual advantage of eliminating ghost claimants and making it too much of a hassle for the wealthy to want to claim it.”

“one can imagine taxes going up from 26% to 31.2% of GDP.”

(from US to European tax levels)

“This would allow every American to get \$3000 per year. For a family of four, this would be \$12,000 per year, half the poverty line.”

I am not convinced by this. See comments that they make (further down) on how it is the people at the bottom of the pile who are ‘self-selected’ out of benefits for various reasons. And fundamentally, people have different absolute needs. Someone in their early twenties living with their parents have less financial needs that other in middle age looking after elderly or special needs dependents. Some UUBI doesn’t fully replace means-testing. In which case, why have it? I have previously said LINK that Universal Basic Services are better. And see my proposed alternative (means-tested, smartphone-based) technology solution further down.

The major advantage of UBI is its simplicity of administering. This was the motivation for the universal credit, a worthy concept to combine various separate welfare schemes, although it was poorly executed in the UK since it was first introduced in 2013. It is universal in the sense of its form of delivery rather than its actual welfare value.

## Cash – The Undeserving Poor

“Roosevelt’s New Deal marked the beginning of an era where poverty was seen as something could fight, and beat, with government intervention. This continued until the 1960s, culminating in Lyndon B. Johnson’s ‘war on poverty.’ But when growth slowed and resources were tight, the war on poverty turned into a war on the poor. Ronald Reagan would return time and time again to the image of the so-called welfare queen, who was black, lazy, female, and fraudulent.”

(!)

That was many years ago. But even now there is the idea of blaming the poor – and not just from expected quarters.

In 2018, Macron was caught talking about:

“the need to make the poor more responsible”

(This is one of a number of times Banerjee and Duflo criticize Macron in the book.)

But Banerjee and Duflo assert that there is no evidence that the

“poor just there is no support in the data for the view that the poor blow the money on desires rather than needs”

Men do not just spend the money on tobacco and alcohol. (Even so, when providing welfare, they are:

“still in favor of giving the money to the woman, because it restores a little of the balance of power within the family”

)

But there is still a problem here in that public values of what welfare should be spent on are different from what the recipients of the welfare want to spend it on. They recall an example of a Moroccan man for whom:

“television is more important than food”

“What the television delivered was relief from the endemic problem of boredom. … The Moroccan did very much insist his preference made sense. Now that he had the television, any more money, he told us several times, would go on buying more food. This is entirely consistent with his view that televisions serve a greater need than food. But it flies in the face of most people’s instincts and many of the standard formulations in economics.”

## Non-Cash Benefits

So it may be better to provide welfare in the form of goods rather than cash – despite what the recipients want – for political reasons regarding the ability of the government to redistribute wealth.

And in some cases, it works the other way around. In India, they heard cases where people preferred being handed food (grain) rather than cash because:

“getting foodstuff protects them against the temptation to misuse. … One respondent said, ‘Food is much safer. Money gets spent easily.”

and so

“getting the gift of the asset freed them up from worries of survival, giving the the bandwidth and the energy to focus on their work.”

There is then the matter of how effective the goods can be distributed. In India, this was costly, complicated and corrupted and, in the end, they found that it was better to just do direct bank transfers, such as with the SNAP program (‘Supplemental Nutrition Assistance Program’) in the USA.

“These days, ‘stamp recipients’ receive government-issued electronic benefit transfer (EBT) cards that are swiped like a debit card in the checkout line, which avoids the stigma of handing over the stamps. But not everyone eligible knows that.”

We need to

“protect as far as possible the dignity of those being helped”

and the

“desire to be respected”

(As we will see, the words ‘dignity’ and ‘respect’ crop up time and time again.)

But often, the self-selecting don’t claim, through a lack of awareness, sign-up barriers or as with someone:

“After a lifetime of being ‘thrown out,’ he had given up on trying to get selected”

The couple suggest a solution may be through marketing:

“those likely to be eligible for SNAP received a pamphlet designed by a PR firm describing the local EBT card as the ‘Golden State Advantage Card.’ It was described as the way to ‘Get more at the grocery shop’ and the emphasis was on the fact that working families might be eligible.”

These last two statements don’t seem to be consistent to me. One general solution I can imagine might be possible is enabled by the move to a cashless society:

• ‘Advantage App: Everyone eligible for a 28-day trial period”
• “Immediate sign-up: Use it to pay for your groceries today”
• Much of the means testing (for whether the benefit continued after a month) could be carried out unobtrusively – monitoring patterns of where the phone is, through GPS.
• This would be like inserting a Universal Basic Income into the welfare system – staggered and just for one month.

(If you query that not everyone would have a smartphone: there are enough spare smartphones lying in peoples cupboards in any developed country to ensure that ownership was universal. Then, the marginal cost of providing a

Then there is the matter of providing everyone with a Universal Basic Service of a phone contract. The marginal cost of doing this for a genuinely basic service (such as a miserly 100 texts, 100 minutes and 100MB per month) must be negligible.

## Work-Related Benefits

The politically-preferred way of providing welfare is by linking it to work – indirectly by linking benefits to pay.

For example:

“Clinton … expanded the earned income tax credit (EITC), which supplements earnings for poor workers (making government assistance conditional on already having some work).”

But

“for people near the tipping point between being takers from and payers into the system, there is potentially a strong disincentive to work.”

Negative Income Tax was something favored by Johnson,  Friedman and Nixon among others.

“A negative income tax (NIT) implements the idea that the system of income taxation should be designed so everyone is guaranteed to receive at least a minimum income. The poor should pay negative taxes so they get paid more than they earn, but as they get richer they will get less in transfers until at some point they start paying into the system.”

There is concern about the impact of welfare on productivity yet, using the Alaska Fund and casino dividends on Cherokee lands as support. They say:

“There is no evidence that people work less. Economists find this surprising.”

Somewhat inconsistently, elsewhere they say it does:

“The results suggest the NIT program did reduce labor supply a bit, but not nearly as much as feared. On average, the reduction in time worked was only between two and four weeks of full-time employment over a year.”

## Welfare through Work

Benefit can be tied to work even more closely if the government is the ‘employer of last resort’.

in scheme like Workfare, an idea becoming topical again:

“In 2018, a very different approach … is gaining ground … presidential candidates Cory Booker, Kamala Harris, Bernie Sanders and Elizabeth Warren have all proposed some kind of federal guarantee, whereby any American who wanted work would be entitled to a good job (\$15 an hour with …) in community service, home care, park maintenance, etc. The Green New Deal proposed by Democratic members of Congress includes a federal job guarantee. The idea is of course not new”

But

“ Such a program is not easy to run well if the experience in India is any guide. Creating and organizing enough jobs would probably be even harder in the United States … Also the jobs would need to be useful. If they were transparently some form of make-work, they would not boost the employees’ self-esteem.”

## More than just Work

There’s more benefit from providing welfare through work than just providing the welfare benefit! Work provides more than just income:

“All the evidence scattered through this book suggests most people actually want to work, not just because they need the money; work brings with it a sense of purpose, belonging, and dignity.”

It gives them a

“‘satisfaction of work well done,’ ‘sense of personal accomplishment,’ ‘opportunity to make a positive impact on community/society,’ ‘opportunities to fully use talents, ‘and ‘Goals to aspire to’.”

“It seems that it is very hard for people to individually figure out how to build meaning into their lives. Most of us need the discipline provided by a structured work environment, to which we then add significance or meaning. This is something that comes up when individuals worry about automation.”

“Since the start of … the 1960s, time spent on leisure activities has increased quite a bit, both for men and women. For young men, a sizable part of this time, since 2004, has been devoted to video games. For all other groups, the bulk of this time has been absorbed by watching television.”

“But apparently watching television and sleeping do not necessarily make us happy.”

“People who have more time on their hands ..  are if anything less likely to volunteer than those employed full time. Volunteering is something we do on top of our regular activities not instead of them.”

## Redundancy

Damningly, they say:

“As economists, we worry about the income loss and the time and effort in finding a new job, but the cost of losing the job itself appears nowhere in our models”

When dealing with people made redundant, we need to make it clear that

“While they may have problems, they are not the problem

“By shifting the narrative … from ‘you are being bailed out’ to ‘sorry this happened to you…’ we might alter the sense that blue-collar workers have that they are victims of war waged by the rest of us against them.”

“all too many politicians do not try to hide their contempt for the poor and disadvantaged.”

“When … Hillary Clinton icily announced that ‘we are going to put a lot of coal miners and coal companies out of business,’ coal workers perhaps not unreasonably felt this was callously undercutting their way of life without ever feeling the need to apologize or compensate them for the loss.”

“Economists are instinctively critical”

of government intervention.

“However, this has been the excuse for not intervening all these years when trade has been robbing people of their livings while claiming to make everyone better off.”

in the

“large pockets of left-behind people … in the hundreds of towns blighted by anger and substance abuse”

“on average, displaced workers never fully recover in terms of earnings after a mass layoff.”

“A study found that when workers with long tenure get fired during mass layoffs, they are more likely to die in the years immediately afterward. Losing a job seems to literally give people heart attacks.”

“more long term problems like alcoholism, depression, pain, and addiction set in.  Overall, the study found that workers displaced in middle age lost between one and one and a half years of life expectancy.”

## Labor Market Intervention

“Many European countries invest much more in their job transition programs than the United States. For the 2% of GDP Denmark spends on active labor market policies (training, job finding assistance, etc.), it gets high job-to-job mobility … three in four displaced workers find a new job within one year.”

“Germany spends 1.45% of its GDP on active labor market policies … The corresponding measure for the United States is just 0.11%.”

“Economists and many policy makers like the Danish model of ‘flexicurity.’ It allows for full labor flexibility, meaning people can be laid off quite easily whenever they aren’t needed anymore, but the laid off are subsidized so they don’t suffer much of an economic loss.”

(I would have said ‘suffer so much’ rather than ‘suffer much’)

Particularly problematic are re-employing older people is particularly problematic. They

“find it more difficult to switch to another career. Retraining them is costly, given that they have relatively few years of work life left. … This is why … we proposed the somewhat radical idea that some workers should be subsidized to stay in place. … Such a policy should only be triggered when a particular industry in an area is in decline, and reserved for older employees (above 50 to 55)”

But

“Economists are instinctively critical of opening up such a large space for governmental discretion.”

Even so, we need to look at the bigger picture.

“The goal of social policy, therefore, should be to help the distressed places that exist, but perhaps more importantly avoid ending up with many more.”

“In a sense, this is what Europe has done with its Common Agricultural Policy. Economists hate it, because a dwindling number of European farmers have gotten a great deal of subsidies at the expense of everyone else. But they forget that by preventing many of the farms shutting down, it has kept the countryside in many European countries much more verdant and vibrant.”

And protecting the countryside attracts tourists.

Subsidizing may be economically distasteful but it is still more palatable than not subsidizing.

## Respect and Dignity

Banerjee and Duflo talk about a number of development schemes to help the disadvantaged, with the need to shift from patronizing to respectful. Time and time again, we encounter the words ‘respect’ and ‘dignity’.

“The deep disregard for the human dignity of the poor is endemic in the social protection system.”

In one scheme, addressing youth unemployment in a town near Paris,

“The young people who arrive have been told, all their lives, what to do. They have also been told, in school and perhaps at home, that they are not good enough. They arrive bruised, wounded, with extremely low self-esteem” … which often translates into an instinctive suspicion of everything offered.”

“all their lives people have given them things. No one has even asked them to contribute.”

“Strikingly, none of them spoke about money. One after other, they spoke about dignity, self-respect, and autonomy”

“the caseworker took time to understand … without ever obviously judging …”

“the caseworker started to focus on convincing the youths that they were in control of their destinies and had what it took to succeed.”

“Many of these young people had never experienced being taken seriously by anyone in an official position (teachers, bureaucrats, law enforcement officials)”

“Work is not necessarily what follows after all the other problems have been solved … but is part of the recovery process itself.”

“The program is essentially a form of therapy aimed at restoring confidence”

## Polarization

In recent years, an increasing polarization of society has become apparent. It might seem that the traditional left-of-center and right-of-center have diverged into a Neo-liberal Left and a Racist Right. A better interpretation is to be had from seeing politics become realigned onto Tony Blair’s ‘open-closed political spectrum’ (this is best seen within a ‘Conservative-Liberal-Socialist’ framework like Hans Eysenck’s). In the terminology Blair chose, there is an implicit preference: ‘open’ is clearly better than ‘closed’. But if we think of ‘open’ as excessively liberal individualism, we might name those terms ‘selfish’ and ‘altruistic’ and see them in a different light. It is now a moral axis, balancing the wants of oneself against those of the wider community. This perspective might make Liberals like us less dismissive of the great unwashed and less likely to demonize them.

## The Behavior of Herds

“it is worth trying to understand why people have such views before we sink deeper into the morass of racism, it is impossible to think about the policy choices we will confront in this book without getting a handle on what these preferences represent and where they come from.”

“black teenagers mostly associate with blacks, and whites with whites. This is what sociologists call homophily. … This does not have to be evidence of intense prejudice. That students in the biggest group do not reach out to outsiders can easily be explained by the fact it is easy for them to meet others like them, and therefore as long as they have a mild preference for their own group, they have no reason to reach out beyond it. The source of the mild preference does not have to be a negative view of anyone else; it could just be that it is easier to be with people who speak the same language, … share the same gestures … humor, TV shows … music.”

A mild preference can become a strong preference when livelihoods are at stake. The couple refer to an example about Swiss cheese producers needing to cooperate, an example from Elinor Ostrom

“the first (and so far the only) woman to receive a Nobel Prize in economics”

(the parenthetic clause now being factually incorrect!). In such communities,

“Systems of mutual help are vulnerable to collapse if some community members have opportunities outside. … The whole system of mutual support may unravel totally, leaving everyone worse off. The community is therefore very alert to, and protective against, behavior that seems to threaten the communal norms.”

“But the fact that norms can be self-enforcing does not necessarily make them good.”

“racial discrimination … can be sustained by the same logic, even if no one actually cares about race

“what looks like conformity could be the outcome of rational decision making by many individuals with no interest in conforming”

The fact that each individual decision is rational does not make the outcome desirable.

“those who violate the norm will be punished by the rest of the community. And so will those who fail to punish violators”

This ‘herd behavior’ is the opposite of what James Surowiecki calls the ‘Wisdom of Crowds’.

• In a Surowieckian ‘crowd’, imperfect individuals lead to good collective decision-making.
• In this herd, perfectly rational individuals lead to a bad collective culture.

## Echo Chambers

So, group exclusion of others that are not like ourselves can be attributed to unconscious self-organization into herds and do not necessarily need to be explained in terms of hatred and such like.

“One obvious downside of sticking to our own is we don’t get exposed to other points of view.”

This leads to group-think and fragmentation of wider society.

The internet should have helped break down barriers between people:

“Originally, virtual social networks were billed as the new public place, the new way to connect, which should have reduced homophily. In principle, they provided an opportunity to connect with distant people with whom we shared some specific interest … These people might not have been like us in other ways, offering us a more eclectic choice of friends than what would result from mere physical proximity.”

“Nevertheless, virtual social networks have mostly failed to integrate their users on divisive issues.”

but this has been part of a more widespread problem

“While it is true that polarization has increased among online users, it has also increased in other spheres of life.”

“it increased the most among those 65 or older, who are least likely to be on the internet.”

“Polarization has also increased in traditional news media. … Fox News has become increasingly slanted to the right, while MSNBC has moved to the left. The audiences have also diverged.”

(The UK’s beloved BBC’s has long provided a single common national source of impartial quality news. But, following recent political stress, even this role is under threat.)

“A powerful demonstration of just how destructive old media can be comes from the Rwandan genocide. … Before and during the genocide, Radio Television Libre des Mille Collines (RTLM) called for the extermination of the Tutsis, whom they called ‘cockroaches’ … Altogether, RTLM propaganda is estimated to be responsible for 10% of the violence, or about 50,000 deaths.”

The internet is reducing the diversity of good journalism in the ‘old’ media:

“The circulation of news on social media is killing the production of reliable news”

“A study found that 55% of the content diffused by news sites in France is almost entirely cut and pasted. … how does the original source get rewarded for its production?”

Artificial Intelligence is increasingly being used for this. And

“Producing fake news is of course very cheap”

particularly when done online.

“Without access to proper facts, it is easier to indulge in nonsense.”

Cass Sunstein, a Harvard law professor talked about

“‘echo chambers’ where like-minded people work themselves into a frenzy by listening only to each other.”

“In 2001, when Sunstein was writing about echo chambers, he was worried about the opportunity users have to choose the news they consume. Increasingly, there is no need to choose. Sophisticated algorithms use machine learning prediction techniques. … The objective, quite explicitly, is to get people what they like so they spend more time on it.”

“The sharp segmentation of the electorate goes much deeper than just policy disagreements. Americans of different political hues have started to positively hate each other.”

“By 2010, nearly 50% of Republicans and over 30% of Democrats ‘felt somewhat or very unhappy at the prospect of inter-party marriage.’”

“Democrats and Republicans do not even speak the same language anymore.”

“Democrats talk about ‘estate taxes,’ ‘undocumented workers’ and ‘tax breaks for the wealthy,’ while Republicans refer to ‘death taxes,’ ‘illegal aliens,’ and ‘tax reform.’”

This is not just about politics. One does not have to look very far on the Twittersphere to find a Twitterstorm full of Twitterbile (stoked by Twitterbots), helped by the relative anonymity of the internet.

Yet we now have the opportunity to receive news from many sources, and we should do. It should be our duty to look to these many sources rather than just choosing one, it order to understand views from an alternative (different, or wider) perspective. We should occasionally watch Al Jazeera and Russia Today. And we should also occasionally read the Frankfurter Allgemeine Zeitung and Le Monde instead of our normal choices – because we now can (probably courtesy of Google Translate).

It is our moral duty to be critical of our own opinions and to try to see into the minds of others. What must it be like for them?

“when users actively choose what they are reading, they are at least conscious of what they are doing. They may prefer to read articles from familiar sources, but be sophisticated enough to acknowledge their own biases reflected in those sources.”

“two young Koreans created an app offering … curated articles … on topical issues and regularly asked them their opinions on the articles and on the issues themselves. At first, all users received a randomly selected article about each issue. After a number of rounds, some randomly selected users got to choose the news sources from which they received their articles … The experiment yielded 3 important results:

• First, users did respond to what they read: they updated their views in the direction of what was being presented to them.
• Second, as expected, those given the option chose articles generally aligned with their partisan preferences.
• Third, however, at the end of the experiment, those who got to choose their articles had updated their preferences more than those who did not, and they had generally updated toward the center!

This is the opposite of the echo chamber effect.”

Just as online algorithms feed us information on what we may like to purchase, they feed us news on what we may like to read. But it would be just as easy to change that algorithm in the case of news stories to direct alternative (unlike) viewpoints to us.

(For further reading on ‘Cyber-Balkanization’, read these Guardian, BBC and Wired articles.)

“accusing people of racism or calling them the ‘deplorables,’ as Hillary Clinton famously did, is a terrible idea. It assaults people’s moral sense of themselves and puts their backs up. They immediately stop listening. Conversely, one can see why calling egregious racists ‘fine people,’ and emphasizing there are bad people ‘on both sides,’ as President Trump did, is clearly an effective strategy (however morally reprehensible).”

This is just about being a good communicator (whether for good or ill) when trying to communicate with people with worldviews different from ourselves. People cannot just absorb conflicting information – they must assimilate new information in such a way they their worldview is still coherent (albeit coherent in a slightly different way). People who do not recognize that other people can see the world differently in quite fundamental ways (whether due to lacking third order intentionality or just plain not thinking about it) will not be able to communicate effectively.

The couple’s recommendation is that

“We should not stop telling the truth, but it is more useful to express it in a non-judgmental way.”

We need to communicate with respect and to preserve dignity.

## Dignity and Contact

“Only a social policy founded on respect for the dignity of the individual can help make the average citizen more open to ideas of toleration”

“Gordon Allport, a professor of psychology at Harvard, formulated what he called the ‘contact hypothesis’ in 1954. This is the idea that under appropriate conditions, interpersonal contact is one of the most effective ways to reduce prejudice. By spending time with others, we learn to understand and appreciate them, and as a result of this new appreciation and understanding, prejudice should diminish.”

How frequently do you have meaningful conversations with people of different levels of education, different political opinions and different ethnicities? (Being served in a shop is not a meaningful conversation.)

“build public housing targeting low-income residents and disperse that housing throughout the city”

although

“It may not be possible to be as bold as Singapore, where strict quotas ensure some amount of mixing between ethnic groups in every block of residential housing”

“schools and universities are obviously key.”

“shared schools are another way to integrate the population.”

“The sum total of all our proposals might seem modest in the face of what feels like a tsunami of prejudice. But that would be to miss the main point of this chapter, which is that such preferences are as much part of the symptoms of the malaise as its cause”

“The most effective way to combat prejudice may not be to directly engage with people’s views, Instead, it may be to convince citizens it is worth their while to engage with other policy issues”

“We need to re-establish the credibility of the public conversation about policy, and prove that it is not just a way to use big words to justify doing very little. And of course we need to try to do what it will take to assuage the anger and deprivation so many feel, while acknowledging it will be neither easy nor quick.”

If we can make contact with ‘the other’, we may make discoveries about them. It is

“more useful to ponder why a particular seemingly irrational choice might actually make sense, rather than to close our minds to its potential logic and attribute it to some form of collective hysteria.”

“We have some sympathy for the idea that people’s choices are coherent, in the sense of being thought out rather than a collection of random acts of whimsy. It is both patronizing and wrong-headed in our view, to assume people must have screwed up just because we might have behaved differently.”

“the choices of the poor often make more sense than we give them credit for”

Recall again the story of the Moroccan man for whom:

“Television is more important than food”

“We learned something by being willing to suspend our disbelief and trust that people know what they want.”

Having listened and considered the other perspective, are we so confident that we would behave differently if we were in the same situation?

## Putting Everything Together

To provide a very succinct summary:

• Without compensatory redistribution, Globalization does hurt the poor in the West. Stickiness and clustering exacerbate this leading to the hopelessness of the left behind people in left behind places.
• Change hurts and needs managing. Welfare requires intervention in the market. Work is more than just income. We must design benefits for what works.
• Growth has stagnated. There is a misallocation of resources. ‘Crony Capitalism’ has led to super-salaries without talent.
• New growth must improve well-being and must be environmentally-friendly.
• The human propensity for herd-like behavior is exacerbated by social media. Communicating across the divide needs to maintain dignity and requires respect and open-mindedness. The other might actually know better.

In their own words:

“The spirit of this book is to emphasize that there are no iron laws of economics keeping us from building a more humane world, but there are many people whose blind faith, self-interest, or simple lack of understanding of economics makes them claim this is the case.”

“what is at stake is the whole idea of the good life as we have known it. We have the resources. What we lack are the ideas that will help us jump the wall of disagreement and distrust that divides us.”

(Note: this talk excludes two chapters from the book: on migrants and on the growing distrust of government.)

## Economics for Hard Times, part 1

[This post is on the ‘Politics/Economics’ strand of this blogsite]

## Introduction

My last talk ‘Economics as if People Mattered’ was named after the subtitle of E. F. Schumacher’s 1972 book ‘Small Is Beautiful’ and was about that book which is now considered as being about environmentalism and the developing world. Reading the book in 2018, I was struck by how relevant it is to economics in the here and now, in the West.

In October 2019, the married couple of MIT economics professors Abhijit Banerjee and Esther Duflo were co-winners of the 2019 Nobel Memorial Prize in Economics, for their work on economics in the developing world, specifically the advancement of Randomized Controlled Trials (RCTs) The following month they published ‘Good Economics for Hard Times’. In the preface of the book they said:

“Many of the issues plaguing the world right now are particularly salient in the North, whereas we have spent our lives studying poor people in poor countries. …

“We got tired of watching from a distance while the public conversation about core economic issues— immigration, trade, growth, inequality, or the environment— goes more and more off-kilter. But also because, as we thought about it, we realized the problems facing the rich countries in the world were actually often eerily familiar to those we are used to studying in the developing world— people left behind by development, ballooning inequality, lack of faith in government, fractured societies and polity, and so on.”

This thinking was ‘eerily familiar’ to me!

Graphs of inequality generally show an about turn in the US and UK in the 1970s. The likes of Milton Friedman, Reagan and Thatcher frequently get the blame. But looking at Schumacher’s 1972 book shows there is something more fundamentally wrong here.

Whilst Schumacher impressed Keynes, he was not an economist’s economist. And 1972 is a long time ago. In contrast, Banerjee and Duflo are now premier league economists. ‘Good Economics for Hard Times’ should be able to provide an authoritative, bang-up-to-date analysis of essentially the same problem, leapfrogging the countless books written about the subject in the intervening half-century.

So here I look at what Banerjee and Duflo have to say. I provide an overview of the book, quoting extensively with some paraphrasing. Where I am not paraphrasing but adding my own (minimal) comments, they are in italics. Graphs and pictures are almost exclusively not from the book.

## Contents

Here is just Part 1 of the talk.

The order of sections here does not match the order in the book. In this list of the sections here, the number in brackets refers to the particular chapter in the book.

• Growth (5)
• Climate Change (6)
• Well-Being (5)
• The Dismal Science (5)
• The End of Growth (5)
• Misallocation of Resources (5)
• The Digital Economy (5)
• Technological Change (7)
• How Important is International Trade? (3)
• Stickiness and Clustering (3)
• The Left Behind (3)
• Taken for Granted (3)
• Inequality (7)
• Winner Takes All (7)
• Reward without Talent (7)
• High Pay = Misallocation (7)
• Tax Cuts for the Rich? (5)
• Tax Avoidance (7)
• The Elite (7)
• The Pecking Order (7)
• The American Dream (7)

The topic of each book chapter can be summarized as:

1. The problem
2. Migrants
4. Polarization
5. End of Growth
6. Climate Change
7. Inequality
8. Trust in Government
9. Welfare

## Growth

The couple spend a lot of pages on GDP growth (too many really). If GDP per capita rises, people are better off on average.

And yet

“despite the best efforts of generations of economists, the deep mechanisms of persistent economic growth remain elusive”

“growth rates for the same country change drastically from decade to decade without much apparent change in anything”

My problem with GDP growth is two-fold:

• GDP growth is not helping so many people in the developed world anywhere near as much as it is helping those at the top of the developed world. GDP should be a measure of well-being. It is a crude measure but there is a reasonable correlation between GDP and well-being. (It has been suggested that this metric be replaced by an alternative to drive the economy: ‘Gross National Well-Being’.)
• The other is the environmental concern. The planet is suffering from consumption. Increasing consumption is not what we need!

We are not getting the right sort of growth – a long-term sustained (and sustainable) greater flourishing of mankind.

Finally, at the end of the chapter on growth (Chapter 5), they acknowledge the latter – and that is the subject of Chapter 6. Credit: http://www.maxgustafson.com/

## Climate Change

There is a huge disparity between rich and poor when it comes to emissions. the 50:10 rule: 10% of the global population contribute 50% of the CO2 emissions and 50% of the population only contribute 10%. Measured in terms of tons of CO2 per capita per annum, US, Western European, Chinese and Indian emissions with 22, 13, 6 and 2 respectively.

The poor are getting better off though, and this will increase emissions. A 10% rise in income leads to a 9% increase in CO2 emissions. A major source of emissions is air conditioning. Without air con, every day where the temperature exceeds 38oC results in the loss of 1 hour’s productivity. And it impacts education too. The energy consumption of the rich is largely for luxuries; for the poor, it is for necessities.

As global warming takes hold, we end up in a viscous circle. Prices of renewables will come down but increasing demand for energy will mean that fossil fuels will not drop away quickly. The couple grimly note there is:

“probably enough coal and oil under the ground to take us to Armageddon”

How should we deal with this?

The 2006 ‘Stern Review’, by Nicholas Stern, former Chief Economist at the World Bank, concluded that

“it would cost about 1% of world GDP annually … to stave off global warming”

Not a huge price to pay. But it is not that easy.

There is the

“dominance of short-term thinking”

and

“economists have a philosophical objection to manipulating preferences”

“the ideal is environmental policy is one that sets a price for damaging the environment but otherwise lets the market do its job. This is the idea behind the carbon tax. … It was key to the work of William Nordhaus, who won the Nobel Prize in 2018.”

“We understand that a carbon tax is not an easy sell … The government should structure the carbon tax in a revenue-neutral way.”

For example, in the 2019 rioting of the Gilet Jaunes:

“The argument that the Yellow Vest protestors were making was that the increase in the gasoline tax was a way for rich Parisians (who can take the subway to work) to buy themselves a conscience at the cost of people from the suburbs and countryside who had no choice but to drive their cars.”

“The government could ‘nudge’ people gently toward choices … for example, smart meters”

“Economists have begun to recognize the role of “habits” in our preferences: People even seem willing to modify their behavior in order to get ready for some future change. Thus announcing a future tax hike on goods gobbling up energy could be an easier way for people to get used to the idea.”

## Well-Being

It is not common for a book within a particular discipline to criticize the practitioners of that discipline’s profession. It would be strange if that criticism was repeated. Banerjee and Duflo make repeated comments about the disconnect between economists pursuit of GDP and well-being:

“the ultimate goal remains one of raising the quality of life of the average person”

“GDP may be one way but it is only one of the ways, and there is no presumption it is the best one”

“The key ultimately … not lose sight of the fact that GDP is a means and not an end”

“Who benefits from this GDP growth remains an afterthought”

“Nothing in either our theory or the data proves the highest GDP per capita is generally desirable”

“Economists have a tendency to adopt a notion of well-being that is often too narrow, some version of income or material consumption. And yet all of us need much more than that to have a fulfilling life: the respect of the community, the comforts of family and friends, dignity, lightness, pleasure. The focus on income alone is not just a convenient shortcut. It is a distorting lens.”

They talk of

“economists’ ongoing love affair with material consumption as a marker of well-being”.

## The Dismal Science

This is just part of their wider criticism. Elsewhere:

“Economists deserve their fair share of the blame for stoking this rhetoric.”

“something … economists do know but tend to keep closely to themselves …”

“… even though they have long known … they have taken it for granted that …”

“every economics undergraduate learns …” but the reality is “decisively less rosy”

“Part of the reason the public so easily bought into the idea … is … the repetition of this mantra by generations of serious economists”

“Even the IMF, a bastion of growth-first orthodoxy, now recognizes that sacrificing the poor to promote growth was bad policy”

(They are not alone.) Perhaps the profession deserves the sobriquet ‘The Dismal Science’ (and not in the way it was originally called).

## The End of Growth

So, having raised the issues of well-being and the environmental problems regarding growth, we return to what Banerjee and Duflo have to say about growth.

“Growth ended on October 16, 1973, or thereabouts, and is never to return … according to … Robert Gordon”.

This brought to an end ‘les Trente Glorieuses’: 30 years of glorious growth (2.5% per annum) following the Second World War., in which quality of life improved dramatically. People ate better, had better heating and cooling, lived longer and healthier, with a shorter working week and earlier retirement.

According to Gordon, this higher productivity was 14% the result of rising education and 19% the result of capital investment that gave workers more and better machines.

“The rest of the observed productivity improvement cannot be explained by changes in the things economists can measure. To make ourselves feel better, economists have given it its own name total productivity factor, or TFP. (The famous growth economist Robert Solow defined TFP to be “a measure of our ignorance”)”

“But in 1973 (or thereabouts), it all stopped”

Why would we not continue with this growth?

“Robert Gordon reminds us of our longer history. It is the 150 years between 1820 and 1970 that were exceptional, not the period of lower growth that followed.”

## Misallocation of Resources

Economists prefer the invisible hand of the perfect market to active industrial policy but they are not very good at working out what causes growth.

“A central tenet of all growth theories … is that resources are smoothly delivered to their most productive use. This is a natural hypothesis as long as markets work perfectly.”

“But this is often not true. In a given economy, productive and non-productive firms coexist, and resources do not always flow to their best use.”

“the economy does not make the best use of available resources … this is what macroeconomists call misallocation.”

A prime example of misallocation the couple provide is about Kerala fishermen landing their catch on the beaches of Kerala. The fishermen would sell until they ran out of fish or customers. Depending on which, it resulted in wasted fish or disappointed customers. When cellphones became available, they could ring ahead and choose to decide where to land. Wasted fish and disappointed customers reduced dramatically in number.

“despite the best efforts of generations of economists, the deep mechanisms of persistent economic growth remain elusive”

In their search for lost growth, Acemoglu, Johnson and Robinson focus on the institutions left by European colonizers:

• where mortality of the colonizers was high, the colony was exploited for resources.
• where mortality was low, colonizers settled, building institutions.

Even so, following Robert Lucas:

“poor countries are poor in substantial part because they make less good use of the resources they have”

There is the old Soviet joke:

“They pretend to pay us, we pretend to work”

When this changed, resources were supposedly allocated more effectively:

“many of the winners of globalization were ex-communist countries that had invested heavily in the human capital of their populations in the communist years or countries threatened with communism that had pursued similar policies for that reason (Taiwan, South Korea)”

As with Japan in the 1970s, it suggests that China’s

“growth could slow down rapidly, once those wasted resources have all been put to good use”

Other economists

“missed the key point… misallocation tells us we have to step beyond the models and think of how the resources are used”

## The Digital Economy

In 1987, Robert Solow quipped, “You can see the computer age everywhere but in the productivity statistics.” That was the computer age. The digital age – the era of the internet really took off around 1995. Growth was higher in the period 1995-2004 than beforehand (1973-1994) but has been lower than the 1973-1994 period since then. So the question is still valid.

“is the measurement of GDP, at best a bit of an exercise in guesswork, somehow missing all the joy and happiness the new economy is bringing us?”

“are nominally free, cheap to run and wildly popular. When, as is now done in GDP calculations, we judge the value of watching videos … we might grossly underestimate its contribution to well-being.”

“the cost of running Facebook, which is how it is counted in GDP, has very little to do with well-being (or ill-being) it generates.”

“Could it be there was real productivity growth, in the sense that true well-being increased, but our GDP statistics are missing this entire story?

“Robert Gordon is entirely dismissive of this possibility. In fact, he reckons Facebook is probably responsible for part for the slowdown – too many people are wasting time updating their status at work. This seems largely beside the point however. If people are actually much happier now than they were before, who are we to pass judgement … ?”

But does Facebook make people happier? One experiment, which involved more than 2000 participants paid to deactivate Facebook for a month, found that those who stopped using Facebook were happier across a range of self-reported measures of happiness and well-being and, interestingly, no more bored (perhaps less). They seemed to have found other ways to keep themselves amused, including spending more time with friends and family.”

“… All of this seems very consistent with the view that Facebook is addictive”

## Technological Change

We are just at the very beginning of the process of AI-fuelled automation.

“The Luddites were less wrong than we might assume”

“When a worker is fired, the firm is done with him, but society inherits the liability of his continued well-being”

A technological revolution might be good for the economy and for its people in the long run but that is no consolation for those hit badly during the transitioning of the technology.

“Blue-collar wages in Britain were almost halved between 1755 and 1802 … they would recover their 1755 levels only in 1820, sixty-five years later. This period of intense technological change in the United Kingdom was also an era of intense deprivation.”

“Those were Hard Times indeed”

Automation should be regulated. It can be excessive, replacing workers

“even when robots are less productive than people. …

“One reason is the bias in the US tax code, which taxes labor at a higher rate than capital”

“Eighty-five percent of Americans would support limiting automation to ‘dangerous and dirty jobs’”

“The last thirty years of US history should convince us that the evolution of inequality is not the by-product of technological changes we do not control: it is the result of policy decisions”

This all concurs with what I said previously: It is not technology that is the cause of the problem here. It is the failure of our politics to manage that technological change.

“Economists mostly talk about the gains of trade. The idea that free trade is beneficial is one of the oldest propositions in modern economics. As the English stockbroker and member of Parliament David Ricardo explained two centuries ago, since trade allows each country to specialize in what it does best, total income ought to go up everywhere when there is trade, and as a result the gains to winners from trade must exceed the losses to losers. The last two hundred years have given us a chance to refine this theory, but it is a rare economist who fails to be compelled by its essential logic. Indeed, it is so rooted in our culture that we sometimes forget the case for free trade is by no means self-evident.

“For one, the general public is certainly not convinced. They are not blind to the gains of trade, but they also see the pains.”

“one of Paul Samuelson’s most famous papers purports to tell us exactly who the losers are. Ricardo’s entire discussion assumed production required only labor, and all workers are identical, so when the economy becomes richer everyone benefitted. Once there is capital as well as labor, things are not that simple.”

“Between 1991 and 2013, the United States was hit by the “China shock.” China’s share of world manufacturing exports grew from 2.3% in 1991 to 18.8% in 2013.”

“opening trade between the United States and China should hurt US workers’ wages”

“workers in the United States can be made better off if society taxes the winners from free trade and distributes that money to the losers. The problem is this is a big “if,” which leaves workers at the mercy of the political process.”

## How Important is International Trade?

“The idea that more trade is good (on balance) is deeply engrained … in economics … there is something … economists do know but tend to keep closely to themselves: the aggregate gains from trade, for a large economy like the United States, are actually quantitatively quite small. The truth is, if the US were to go back to complete autarky, not trading with anyone, it would be poorer. But not that much poorer.”

There is only indirect reference to Trump’s wall. Quoting others:

“About 8 cents of every dollar spent in the United States is spent on imports. What if, because of a wall or some other extreme policy intervention, these goods were to remain on the other side of the US border?”

“The gains from trade about 2.5% of GDP. This is really not a lot.”

“1 year of decent growth could pay for sending the US economy into complete autarky, in perpetuity!”

“The US import share … is one of the lowest in the world.”

Of course, the United States is a very large economy, representing about 15% of World GDP. In comparison,

“Belgium, a small open economy, has an import share of above 30%.”

and only represents 0.6% of World GDP.

## Stickiness and Clustering

The impact of the exporting of industries to the developing world is exacerbated by two factors: stickiness and clustering.

Stickiness applies to labour as well as capital. Sticky lending means that existing firms that should have been put out of their misery continue to hang on whilst new businesses have a hard time raising capital. Firms are sticky: managers find the transition process costly in cost and effort to re-purpose (re-train) employees. People are sticky too. They don’t just up sticks and travel half-way across the country for work. They have family ties, upon which they rely on. And they are risk-averse. This is true for migrants too. Most will not emigrate unless things have got a lot worse at home.

Clustering: Skills are mobile – up to a point. Stickiness is much reduced if, when a company fails, employees can just move to other jobs within commuting distance. This is how new companies can thrive in the good times. They can benefit from these local skills. And they benefit through geographical association; it is easier to get contracts when you are located where buyers are used to buying from. Hence we get clustering of

• children’s wear factories in Huzhou, China,
• biotech industries around Boston, MA,
• clock manufacturing in Michigan,
• golf equipment in Carlsbad, CA, and
• furniture and textiles in Tennessee.

But this is bad news when the jobs go because there are no similar jobs nearby because they are going too. And there are hardly any jobs in other lines of work in the locality because people can no long afford to spend.

## The Left Behind

The sticky economy became a sticky trap. We end up with

“Left-behind people in left-behind places.”

Towns like Bruceton, Tennessee are unable to entice prospective new employees to the area: the company executives are put off by the depressing Main Street and decide they don’t want to live there.

We end up with a downward spiral:

• Of fewer marriages and fewer children born,
• Students less likely to graduate
• Depression
• and “Deaths of despair”, from drugs, alcohol and suicides.

The communities are left with a

“deep hopelessness once associated with African American communities.”

## Taken for Granted

So whilst

the reality is

“decisively less rosy”.

“Taken together, the exchange of goods, people, ideas, and cultures made the world much richer. Those lucky enough to be in the right place at the right time, with the right skills or the right ideas, grew wealthy, sometimes fabulously so.”

“For the rest, the experience has been mixed. Jobs were lost and not replaced.”

“Economists and policy makers were blindsided by the hostile reaction to free trade, even though they have long known that as a class, workers were likely to suffer from trade in rich countries and … have taken it for granted that workers would be able to move jobs or places”

“Trade has also created a more volatile world where jobs suddenly vanished only to turn up a thousand miles away. The gains and the pains ended up being very unequally distributed and they are, very clearly, starting to bite back at us”

Collectively growing richer is no consolation for the left behind.

Now,

“At a political level, spending large sums of money is an admission that trade adjustment costs are in fact large”

“The right way to pay for this is to use general tax revenue. To the extent we are all benefitting from trade, we should collectively pay for the cost.”

But there are practical difficulties of intervening to redress these issues. A proposal

“may be seen as a form of trade protection and run afoul of WTO rules”

“The overarching takeaway is that we need to address the pain that goes with the need to change, to move, to lose one’s understanding of what is a good life and a good job. Economists and policy makers were blindsided by the hostile reaction to free trade, even though they have long known that as a class workers were likely to suffer from trade in rich countries and benefit from it in poor countries. The reason is they have taken it for granted that workers would be able to move jobs or places, or both, and if they were not able to do this, it was somehow their failing. This belief has colored social policy, and set up the conflict between the ‘losers’ and the rest that we are experiencing today.”

## Inequality

I am astonished to discover that

“In the United States, the top marginal tax rate was above 90% from 1953 to 1963.”

But

“the 1980s ushered a dramatic change in the social contract in the US and the UK. Whatever economic growth happened since 1980 bas been, for all intents and purposed, siphoned off by the rich.”

“For both Margaret Thatcher in the UK and Ronald Reagan in the US, what was to blame for the slump in the late 1970s was clear (though we now know they really had no idea). The countries had drifted too far to the left – unions were too strong, the minimum wage was too high, taxes were too onerous, regulation was overbearing.”

“At the same time, around 1980, wages stopped increasing, at least for the least educated.”

“Average real wage in 2014 was no higher than in 1979. Over the same period (from 1979 to today) ,the real wages for the least educated actually fell.”

“Those making, say, between \$100,000 and \$200,000 a year have seen their pay increase only slightly more rapidly than the average, while those who are making more \$500,000 have seen their incomes explode.”

The fact that this great reversal takes place during the Reagan and Thatcher years is probably not coincidental. … Their election was also a symptom of the politics of the time …  It is not impossible that if they had lost, whoever won would have gone some distance along the same path.

“More importantly, it is not a priori obvious that Reagan-Thatcher policies were the main reason why inequality went up.”

Perhaps the end of growth after the rebound from World War 2 has run its course. Or China. post-Mao. Or some mixture. But this does not explain differences between the Anglo economies as other developed countries. See the ‘World Wealth and Income Database (2018)’ graphs below. ‘World Wealth and Income Database (2018) Credit: OurWorldInData.org

“The diagnosis of what actually happened … has been and continues to be an active area of debate within economics, with some, like Thomas Piketty, squarely blaming changes in policies, while most economists emphasize that the structural transformation of the economy, and particular changes in technologies, also had a lot to do with it.”

## Winner Takes All

Clustering doesn’t just work against us in the bad times. It can work against us in the good times too.

“the highest-paid workers in low-paying firms are moving to those that pay more. More productive workers are increasingly working with other high-productivity worker, in which superstar firms attract both capital and good workers.”

“doubling the number of high-skilled people in the Valley makes all of them more productive”

“a firm in a failing local economy that sell to the national or world economy, can do well, for a while, until the overall drain in talent as the young and the ambitious move out.”

Larger companies are accounting for an increasing proportion of the sectors they operate in.

“We are doomed to end up with monopolies.”

particularly as there is a

“fairly liberal attitude on mergers in the United States.”

“If monopoly is guaranteed forever, … innovation and growth may slow down; a monopolist can sit on their hands and never invent anything new.”

This means that it is

“increasingly difficult for new entrants”

to access the market. This damages growth.

“There is the misallocation we do not see, the great ideas that never see the light of day”

As E. F. Schumacher said: in ‘Small is Beautiful’: “Nothing succeeds like success and, equally, nothing fails like failure.”

Duflo and Banerjee continue:

“The problem of course is that small is not beautiful. A minimum scale is required to allow firms to employ specialized workers or to use high-productivity machines. Small firms are much less productive than larger ones.”

and

“Firms can only be large if the market is large.”

## Reward without Talent

“Once a firm has invested in a galaxy of talents, the CEO of such a firm is in a position to make a big difference” … hence “an obscene salary”

“compensation committees … use the salary of other CEOs as a benchmark. This creates a contagion”

“normalizing the idea that CEO pay was directly linked to shareholder value”

“managers’ pay was no longer linked to the salary scale within the enterprise. When everyone was on the same scale, the CEO had to grow the salaries at the bottom to increase their own.”

When CEO pay is directly linked to shareholder value, it is in the rational interest of the CEO to maximize share price in the short term rather than hold out for a possibly higher / possibly lower value in the future, when he may no longer be CEO of that company. Long-term considerations are ignored.

“The one exception, which in some ways proves the rule, is that CEOs of companies where there is a single large shareholder who sits on the board (and is vigilant because it is his own money on the line) get paid significantly less for luck than for genuinely productive management.”

Following

“the top tax rate was brought down to 70% in 1965”

there is the following curious statement:

“At 70% marginal tax rate … it makes salary less valuable for the CEO, and it becomes cheaper for the board to pay the CEO in other ‘currencies,’ such as allowing him to pursue his dream projects. … For example, the CEO might prioritize growing the firm.”

And regarding the status of CEOs:

“When the government takes 70% of the money … the highest-paid job is still the highest-paid job”

To prove the argument, in Denmark (top tax rate 62%)

“top incomes never went sky high”

What differentiates Denmark (and most other countries) from

“the US and UK dominate the ‘high end’ of finance”

“Financial sector employees are now paid 50-60% more than other workers of comparable skills.”

“But it is hard to see why an ordinary manager in the financial sector is nonetheless paid extraordinary fees, year after year. In fact, most years, actively managed funds do not do any better than ‘passive funds’ that simply replicate the stock market index. In fact, the average US mutual funds underperform the US stock market…”

The high pay justifies the fund managers’ rational strategy of staying with the herd for as long as you can hold on to the job for fear-of-missing-out on subsequent salary, again focussing on short-term performance.

And then the killer lines:

“They seem to have borrowed the language of individual talent but not the talent itself.”

“The superstar narrative does not fit finance very well. Finance is not a team sport.”

In sport

“the one argument no one makes is that players would play harder if only they were paid a little (or a lot) more. Everyone agrees that the drive to be the best is sufficient.”

## High Pay = Misallocation

“The reason to be concerned about this is that if some job pays a premium unrelated to its usefulness, like the fund managers earning a fat fee for doing nothing, or the many talented MIT engineers and scientists hired to write software that allows stock trading at millisecond frequencies, then talented people are lost to firms that might do something more socially useful. Faster trading may be profitable … but … it seems implausible that it improves allocation of resources in the economy in any meaningful way.”

“Maybe in a saner world they would have been … curing pancreatic cancer.”

“A silver lining of the 2008 crisis is that it reduced the appeal of the financial sector for the brightest minds; a study of career choices of MIT graduates found those who graduated in 2009 were 45% less likely to choose finance than those who graduated between 2006 and 2008. This may lead to better allocation of talent”

## Tax Cuts for the Rich?

Banerjee and Duflo devote quite a few pages to Paul Romer (Chief Economist, World Bank and 2018 Nobel laureate):

“for Romer the government needs to get out of the way of stifling incentives to work hard and invent … in other words, cut taxes”

Supposedly

“Low tax rates are necessary at the top because the likes of Bill Gates need to be given the incentive to work hard, be creative, and invent the next Microsoft.”

God forbid anyone invents any more products as bad a Microsoft: see note on guaranteed monopolies earlier. I doubt that tax rates would have made any difference to Bill Gates long before he had earned his first billion. Indeed, he has chosen to add more meaning to his life by giving almost all his wealth away, effectively imposing a 99.96% tax rate on himself.

But

“The low tax rate ushered in by Reagan did not deliver faster growth.”

and astoundingly

“It was not always like that. Top tax rates were above 77% for the period 1936-1964, and above 90% for about half that period, mostly in the 1950s under a solidly right of center Republican administration.”

“tax cuts benefitting the top 10% produce no growth in employment and income, whereas tax cuts for the bottom 90% do”

“Part of the reason the public so easily bought into the idea that tax cuts for the wealthy lead to economic growth is … the repetition of this mantra by generations of serious economists”

## Tax Avoidance

The rich react to a tax rise by finding ways to not pay taxes:

“Lionel Messi was found guilty on 3 counts of defrauding tax authorities of €4.1 million.”

“Cristiano Ronaldo left Spain for Italy to lower his tax bill.”

The problem is:

“It is easy to avoid the taxes … particularly in small European countries where people can move or park their wealth abroad. This gives rise to a race to the bottom on tax rates.

“We should not lose sight of the fact, however, that all of this happens in part because the world tolerates tax evasion.”

“Gabriel Zucman has convincingly argued that there are many relatively straightforward things that would help a lot in limiting tax evasion and tax avoidance. Among his ideas are to create a global financial registry, … reform the corporate tax system such that the global profits of multinational firms are apportioned to where they make their sales and to more strongly regulate banks and law firms that help people evade taxes through tax havens. … The three steps Zucman recommends may be particularly tricky since they involve international cooperation”

## The Elite

“we seem to be in the midst of a vicious cycle of concentration of political and economic power. As the rich become richer, they have more interest and more resources to organize society to stay that way”

The wealthy are good at lobbying and finance politicians’ campaigns.

“It seems unlikely this can continue unfettered without generating a massive backlash. The populist uprising in the United States is the beginning of this backlash. Behind it is a profound sense of disempowerment, a feeling, right or wrong, that the elites always decide”

“The obsession with growth at the root of the Reagan-Thatcher revolution, and that no subsequent president has taken issue with, has caused lasting damage. When the benefits of growth are largely captured by a small elite, growth can be a recipe for a social disaster.”

“Citizens’ confidence in society’s ability to deal with this issue might be permanently undermined.”

## The Pecking Order

Meanwhile, lower down the pecking order…

“people’s sense of self-worth is related to the position in the groups they see themselves as part of”

“increased awareness about one’s place in the distribution of income increases the extent to which a persons happiness depends on income.”

“It is impossible for those who are stuck to not be aware that the rest of the world looks like it is moving ahead. The flip side of this is the impulse to show the world that we too are able to ‘keep up with the Joneses’.”

“The awareness of one’s place on the totem pole does seem to affect well-being.”

“The urge to show off is less strong when people feel good about themselves.”

“This creates a vicious cycle, with people who feel economically vulnerable being particularly eager to demonstrate their worth through useless purchases they can ill-afford and an industry all to ready to provide these services for a handsome fee.”

A pecking order might reduce stress in baboons to maintain order within a troop but we humans should be able to do better than that. We cannot say that our society of inequality and conspicuous consumption is a strong, happy society.

## The American Dream

But still

“Americans tend to believe, in spite of everything, that although their society is unequal, it rewards industry and effort.”

In the past

“Everyone had a shot at striking it rich.”

and

“that the poor must be in part responsible for their own plight”

Tied in with ‘American work ethic’,

“Americans still believe in this American dream, though as a point of fact, heredity plays a greater role in the fortune of today’s Americans than it does in Europe.”

“Within the OECD, the child from the bottom quintile most likely to remain stuck in the bottom quintile is from the US (33.1%), while the least likely is from Sweden (26.7%).”

“The probability of moving to the top quintile is 7.8% in the US, but close to 11% on average in Europe.”

“By all measures, despair is on the rise in today’s America, and it has become deadly. There has been an unprecedented increase in mortality among less- educated white in middle age and decrease in life expectancy. … This grim trend is specific to US whites, and in particular to US whites without college degrees. …

“Other English-speaking countries that have pursued a broadly similar social model to the US, namely the UK, Australia, Ireland and Canada, are also going through a similar change, albeit in slow motion. In all other wealthy countries, on the other hand, mortality is going down, and going down faster for the uneducated … than for the educated.” Source: Miles Corak. Credit: americanprogress.org

[to be continued…]

## The Bayesian Origins of Neural Nets

[This post is on the ‘Artificial Intelligence’ strand of this blogsite]

TL;DR: Just skip down to the big picture and read the ‘The End Result’ section.

### Introduction

As part of my long journey towards understanding Karl Friston’s ‘Free Energy’ theory of the brain, I stumbled upon Oleg Solopchuk’s excellent “Tutorial on Active Inference” medium.com article. At some point I will give it full justice by reading in detail; even as a tutorial, it is not a light read. For example:

“So by minimizing Free Energy, the arbitrary distribution q(s) gets closer to the posterior p(s|o), and if they match, KL term becomes 0 and Free Energy is exactly equal to surprise. So the more we minimize Free Energy by wiggling q(s), the more it becomes closer to surprise (since it’s an upper bound), and by minimizing Free Energy by wiggling parameters of p(o,s) we can minimize surprise even further.”

This is typical of impenetrable Fristonian text. The problem here is the complexity of the underlying ideas, not the Herculean task of trying to explain it. Trying to understand Friston is notoriously difficult. Reading the deluge of complex math strung together with long sequences of ordinary words forming nonsensical sentences, it becomes funny rather than profound (see the parody Twitter account @FarlKriston).

In ‘God Help Us, Let’s Try To Understand Friston On Free Energy’, Scott Alexander indirectly quotes someone’s Neuropsychoanalysis paper (without citing):

“At Columbia’s psychiatry department, I recently led a journal club for 15 PET and fMRI researchers, PhDs and MDs all, with well over \$10 million in NIH grants between us, and we tried to understand Friston’s 2010 Nature Reviews Neuroscience paper – for an hour and a half. There was a lot of mathematical knowledge in the room: three statisticians, two physicists, a physical chemist, a nuclear physicist, and a large group of neuroimagers – but apparently we didn’t have what it took. I met with a Princeton physicist, a Stanford neurophysiologist, a Cold Springs Harbor neurobiologist to discuss the paper. Again blanks, one and all.”

He goes on:

“Normally this is the point at which I give up and say “screw it”. But almost all the most interesting neuroscience of the past decade involves this guy in one way or another. He’s the most-cited living neuroscientist, invented large parts of modern brain imaging, … His Am I Autistic – An Intellectual Autobiography short essay, written in a weirdly lucid style and describing hijinks like deriving the Schrodinger equation for fun in school, is as consistent with genius as anything I’ve ever read.”

Oleg’s article is a big step-up. He strongly recommends Rafal Bogacz’s tutorial and is right to do so – it is another big step up the long ladder up to the lofty heights of a real understanding of the ‘Free Energy’ theory of the brain.

My post here is an overview of Bogacz’s tutorial – a rung or two below that on the ladder. In increasing levels of technicality (and general impenetrability), some of the rungs are:

1. Previous posts by me are near the bottom: ‘Entropy, Intelligence and Life’ and ‘Free Energy: Hierarchical Message Passing’ (both as part of the wider ‘Intelligence and the Brain’ talk).
2. This post.
3. Oleg Solopchuk’s “Intuitions on predictive coding and the free energy principle”  excellent medium.com post (“This non-technical post is on theoretical neuroscience and aims to give simple visual intuitions without entering into the underlying math”)
4. Oleg Solopchuk’s “Tutorial on Active Inference” medium.com article.
5. Rafal Bogacz’s tutorial.
6. “An Approximation of the Error Backpropagation Algorithm in a Predictive Coding Network with Local Hebbian Synaptic Plasticity” by James C. R. Whittington and Rafal Bogacz naturally follows on from the above,
7. The many original papers by Karl Friston himself, such as ‘The free-energy principle: a unified brain theory?’.

(Emphasize: These are only some of the rungs.)

At the top of the ladder, I suspect there are less than 100 people in the world who really understand. At the bottom rung, my aim is to provide an understanding potentially to hundreds of millions of people [hmmm: number of blog visitors might need to increase a bit though]:

• old enough to have completed high-school in order to have been exposed to sufficient math (differentiation, probability, …).
• young enough to be exposed to programming at school and able to code (preferably in Python, the most widely taught language in schools).

Bogacz’s paper includes ‘Try this Yourself’ suggestions for going through the math equations to get a proper understanding. He also has programming exercises, with Matlab answers in the appendix. I have previously said, this is where real understanding comes from – being able to replicate the behaviour in a simulation:

Andy Clark tells a tale of how, years ago, the philosopher Dan Dennett was talking to an eminent professor of paleontology (who he does not name). The professor complained that his students were cheating at stratigraphy by just copying out diagrams from books. Dennett responded that he should get them to play a computer game. If you want to test their understanding of stratigraphy, you don’t ask the question ‘describe how such-and-such a layer is formed’; you provide them with a game in which various knobs cause various events, such as: Deposition of sediment, erosion, intrusion of lava, and controlled fracture.

If the students can play the game by twiddling the knobs in a correct order and with the correct intensity to create the required layers, they have demonstrated that they really do understand the process and can be marked accordingly.

Note: The section numbers in the following sections map across to those in the Bogacz paper.

### 2: Simplest example of perception

Sense inputs are noisy and non-linear: there is a difference between how things actually are in the outside environment and how they have been sensed. (It might be useful to think of this in Kantian terms: the noumenal and the phenomenal respectively). My example: sounds made by another, with background noise as well, are converted to irregular firings of multiple neurons, each neuron tuned to respond to a particular frequency, not necessarily spaced at regular frequency intervals.

Bogacz’s example is of light intensity being indicative of object size. At one instant in time, the light intensity might indicate an object of size s1. But the probability of there actually being an object of size s1 might be very low. It might be more likely that what has been observed is an object of size s2, with a particular level of noise. Bayes theorem can help us in this circumstance to make best guesses about the size of the object.

#### Aside: Probabilities and Bayes Theorem

(See Markov Blankets and the Goat in the Gameshow for previous introductions to Bayes theorem.)

Probabilities may be either ‘marginal’, ‘joint’ or ‘conditional’:

• Marginal probability: unconditional P(A) = P(playing pack card is a heart) is just ¼. P(card is a 4) is just 1/13.
• Joint probability:  P(B) = P(4 of hearts) = P(A ∩ B) = ¼ x 1/13 = 1/52.
• Conditional probability:  P(A|B) is the probability of event A occurring, given that event B occurs. Example: P(king)=1/13 and P(face-card)=3/13 but P(king | face-card)=1/3.

We can relate different types of probability as follows:

• Pconditional(H|E) = Pjoint(H ∩ E)/Pmarginal(E).

And noting that joint probabilities are ‘commutative’:

• Pjoint(H ∩ E) = Pjoint(E ∩ H)

Then:

• Pjoint(H ∩ E) = Pconditional(H|E).Pmarginal(E)
• Pjoint(E ∩ H) = Pconditional(E|H).Pmarginal(H)

hence Bayes Theorem:

• Pconditional(H|E).Pmarginal(E) = Pconditional(E|H).Pmarginal(H)

or rearranging to the more familiar form:

• Pconditional(H|E) = Pconditional(E|H).Pmarginal(H)/Pmarginal(E)

which expresses a ‘posterior’ in terms of the ‘likelihood’, ‘prior’ and ‘evidence’ (respectively):

• P(H|E) = P(E|H).P(H)/P(E)

#### 2.1: Exact solution

In Bayesian terms in this instance:

• the ‘hypothesis’ is that the size is a particular value.
• The ‘evidence’ is a particular light intensity.
• The ‘prior’ is the probability before having being given the evidence.
• The ‘posterior’ is the probability after having being given the evidence.

It is important to remember that the priors and hypotheses are probability distributions. These are represented by multiple probabilities across the full range of all possible values. The probability distribution for a die is across 6 possible values – each with probability 1/6. We may over time learn that the die is unfair and provide a different distribution.

#### 2.2: Finding the most likely feature value

It is impractical to store such distributions in our heads. Distributions will tend to go from low at extremes to a peak somewhere. We can approximate a distribution as a Gaussian function with just 2 numbers: a mean (average) and a variance (how spread out).

The size estimate will change over time, dependent upon the changing environment plus noise. The best guess of the size is at the peak of the probability distribution. We can find the location of that peak using gradient descent: Note: gradient descent (specifically, stochastic gradient descent) is a standard technique in training neural nets. For this, we use differentiation with respect to time to hunt for the peak:

• If the slope at our current guess is positive, it must be lower to the left hence we shift left by subtracting something from our guess value (proportionate to the slope).
• If the slope at our current guess is negative, it must be lower to the right hence we increase our guess value.

Iterating, we home in to the best guess.

#### 2.3: A possible neural implementation

‘Prediction errors’ are an important concept in understanding ‘Predictive Processing’. The gradient descent math is simplified by using two intermediate terms which are prediction errors:

• how much the light intensity differs from that expected at a particular size.
• how much the inferred size differs from prior expectations.

The equations can be converted to cyclic directed graphs: ‘bubble diagrams’ where each bubble represents a neuron, or vector of neurons.

Because of this relationship between the equations and neurons, we can associate the neurons with some original Bayesian purpose – such as representing the mean or the variance of a particular expected probability distribution.

I’m not going to show step-by-step bubble diagrams for each section here (you can refer to the original paper from those) but just show an overall bubble diagram at the end which brings all the ideas together.

#### 2.4. Learning model parameters and 2.5. Learning the relationship between variables

As time goes by, we need to update our expectations based on previous experience. Differentiating with respect to the model parameters (the mean and variance) rather than time shows us how to update those mean and variance values.

The differentials can be expressed solely in terms of the neuronal values and a single parameter. This gives us something that maps very well to real neuron ‘learning’: Hebbian learning where ‘neurons that fire together wire together’ rather than relying on mechanisms like in back-propagation in artificial neural networks that are not biologically plausible.

[An exercise for the future: It would be nice to compare what we have here with online implementations of Gaussian Mixture Models, where data is fitted to one of a number of Gaussian curves, with the means and variances of those Gaussians being adjusted on-the-fly.]

### 3: Free-energy

Here I just skip over the math and discussion around (Friston’s) Free Energy concepts here. That is for another time. Here, I am just interested in getting from Bayes theorem to multi-layer neural nets.

### 4: Scaling up the model of perception

Here, sections 4 and 5 just explain development of the framework into a full network of neurons. You can ignore these sections if you want and jump straight to ‘The End Result’.

#### 4.1. Increasing the dimension of sensory input

Scaling up from a network that tracks the distribution of just one parameter, to one with many parameters just means converting all the operators to vectors.

#### 4.2. Introducing hierarchy

The bubble diagrams have been drawn with inputs coming in from the left and constants coming in from the right. This single-stage can be easily scaled up so that inputs cascade between stages from left to right, and the signals from the right being feedback signals from the stage above. This layering is what is found in the 6-layer neocortex of real brains.

Having already made the transformation to vectors in section 4.1 to handle multiple neurons per layer, there is a fully-connected matrix of connections between the two different types of neurons (ε and φ) within each layer.

### 5. Local plasticity

Scaling up to multiple neurons per layer has an unfortunate consequence of there needing to be a matrix inversion to handle learning. This means that every neuron relies upon every other neuron in order to learn which is biologically implausible.

To work around this, extra neurons are added (e neurons in the diagrams, fully connected to the ε neurons) with the benefit of having each connection’s updating dependent only on the (local) neurons (nodes) at the two ends of the connection i.e. the learning is Hebbian. ### The End Result

The end result of Bogacz’s tutorial journey is a network of neurons as shown above (not shown in so complete a form in the original paper).

The diagram above shows a 4-layer neural network, with 2 inputs per layer. There is an input layer (green), 3 internal layers (blue, red and orange) and a top stub (purple). This is the result of a journey that started with Bayesian estimation of noisy inputs. Those original Bayesian distributions with mean vp and variance , with a non-linear input function u=g(v)=θ.h(v) and producing a peak value φ, have been through many transformations, with simplifications through the introduction of ε error terms, into what looks very much like a multilayer neural network. That network still shows signs of its origins in referencing the , θ, ε, vp and φ terms.

Each layer of the network is fully-interconnected to the layer above (towards the purple) with neural weights θ. A φ neuron has an (upstream) output that is a non-linear function of a weighted sum of lower-level neuron outputs. The θ terms are the weights.

In neural nets:

• that non-linear function is often the sigmoid function and
• the weights are adjusted during back-propagation based on the weights and the derivative of the non-linear function.

Here, the weights are indeed the same in both directions (ε forward to vp and φ backwards to ε) but the function and its derivative are the other way around.

The e neurons (linear nodes) have been added to help compute the prediction error and to learn the variance of the corresponding feature with a local Hebbian plasticity rule. Note that since all the e neurons are full-connected to the ε neurons, the weights indicate co-variances.

The Hebbian rule is that the weight change is based on the pre- (ε) and post- (φ) synaptic neurons, and only them (“Neurons that fire together, wire together”). This is true for both:

• θ weights between ε and φ: The weights are updated based on the gradient ∂F/∂θ = ε.h(φ)
• weights between ε and e: The variance weights are updated according to the rule Δ = α(ε.e − 1) where α is the learning rate parameter.

In the former case, the updates are proportional to the product of the pre- (ε) and post- (φ) synaptic neuron states. In the latter case, this is nearly so.

### Python Code Solutions to the Exercises

The Bogacz paper includes Matlab code solutions to the exercises (numbered 1 thru 5, excepting exercise 4 which is not a programming exercise). As you are unlikely to have Matlab to run the code, you could install Octave instead which should run, perhaps with some minor modifications.

Here, I provide Python alternatives (exercises 1, 2 and 3 have been written from scratch. Exercise 5 has been lazily recoded from the paper’s Matlab solution). As with Python code in previous posts, if you don’t know how to install and run Python code, the simplest thing to do is copy and paste the code into an online Python 3 interpreter, such as https://repl.it/@jef12345/python-361 (and click run).

All exercises have been combined into a single piece of code. Just set the exercise_number variable to the number you want.

```#! /usr/bin/env python

#import os, sys, random, time
import numpy as np
from scipy.stats import norm
import matplotlib
import matplotlib.pyplot as plt
import math

exercise_number = 5

if exercise_number <= 3: # Common between exercises
def function_g(size):
average_light_intensity = size*size
return average_light_intensity

def function_g_tick(size): # derivative of g w.r.t. phi (peak size probability)
derivative = 2*size
return derivative

def function_f(x, mean, variance):
std_deviation = math.sqrt(variance)
return norm.pdf(x, mean, std_deviation)

# Bayes theorem
# P(H|E)= P(E|H).P(H)/P(E)
# H:        hypothesis
# E:        evidence
# P(H):     prior
# P(H | E): posterior
# P(E | H): likelihood
# P(E):     evidence
def bayes_posterior(prior, likelihood, evidence):
return prior * likelihood / evidence

# observed light intensity
u = 2.0
sigma_u = 1.0 # variance of the light intensity
# prior: experienced inputs over the +/-1 std_deviation interval 2...4:
v_p = 3.0
sigma_p = 1.0

if exercise_number == 1:
print('#########################################')
print('Exercise 1')
print('#########################################')

integral = 0 # scalar integral
steps = 500
v           = np.zeros(steps) # actual size
p_v         = np.zeros(steps)
p_u_given_v = np.zeros(steps)
p_v_given_u = np.zeros(steps)
for i in range(steps):
v[i] = (i+1)/100 # size 0.01...5
p_v_given_u[i] = (i+1)/100
p_v[i]         = function_f(v[i], mean=v_p, variance=sigma_p)
p_u_given_v[i] = function_f(u, mean=function_g(v[i]), variance=sigma_u)
integral += p_v[i] * p_u_given_v[i] # integral( p_v * p_u_given_v, dv)

print("integral = %10.5f" % integral)
# integral should be 1 but it is actually larger
p_u = 1/integral # scale results to ensure that the posterior probabilities of all sizes p(v|u) integrate to 1

for i in range(steps):
p_v_given_u[i] = bayes_posterior(prior=p_v[i], likelihood=p_u_given_v[i], evidence=p_u)
#print("%P(v)=10.5f P(u|v)=10.5f P(u)=10.5f " % (p_v[i], p_u_given_v[i], p_u))
temp = p_v[i] * p_u_given_v[i] / p_u
#p_v_given_u[i] = temp
#print('%10.3f   %10.3f' % (x_array[n], y_array[n]))

if exercise_number == 2:
print('#########################################')
print('Exercise 2')
print('#########################################')

def dF_by_dphi(t, phi):
first_term  = (v_p - phi)/sigma_p
second_term = (u - function_g(phi))/sigma_u * function_g_tick(phi)/sigma_u
print("dF_by_dphi: t=%3d phi=%10.5f term1=%10.5f term2=%10.5f" % (t, phi, first_term, second_term))
return first_term + second_term

delta_t = 0.01

tsteps = 500
phi    = np.zeros(tsteps)
for t in range(tsteps-1):
phi[t+1] = phi[t] + delta_t * dF_by_dphi(t, phi[t])
#print("t=%3d phi[t]=%10.5f" % (t+1, phi[t+1]))

if exercise_number == 3:
print('#########################################')
print('Exercise 3')
print('#########################################')

delta_t = 0.01

tsteps = 500
phi    = np.zeros(tsteps)
epsilon_u    = np.zeros(tsteps)
epsilon_p    = np.zeros(tsteps)
epsilon_u = 0
epsilon_p = 0
for t in range(tsteps-1):
# From the equations...
# Errors:
#epsilon_u = (u - function_g(phi))/sigma_u #(11)
#epsilon_p = (phi - v_p)/sigma_p #(10)
#dF_by_dphi = -epsilon_p + epsilon_u * function_g_tick(phi)
#phi = phi + dF_by_dphi * delta_t

# From above, we can derive Figure 2 relationships...
epsilon_u_dot = u   - function_g(phi[t]) - sigma_u * epsilon_u[t]  #(14)
epsilon_u[t+1] = epsilon_u[t] + epsilon_u_dot * delta_t

epsilon_p_dot = phi[t] - v_p             - sigma_p * epsilon_p[t]  #(13)
epsilon_p[t+1] = epsilon_p[t] + epsilon_p_dot * delta_t

dF_by_dphi = -epsilon_p[t] + epsilon_u[t] * function_g_tick(phi[t])
phi[t+1] = phi[t] + dF_by_dphi * delta_t
print("t=%3d phi[t]=%10.5f e_p[t]=%10.5f e_u[t]=%10.5f" % (tsteps-1, phi[tsteps-1], epsilon_p[tsteps-1], epsilon_u[tsteps-1]))

if exercise_number == 5:
print('#########################################')
print('Exercise 5')
print('#########################################')

mean_phi  = 5;    # mean of input from the current level
Sigma_phi = 2;    # variance of input from the current level
phi_above = 5;    # input from the level above
NUM_STEPS = 2000
DT        = 0.01  # integration step
MAXT      = NUM_STEPS * DT    # maximum time considered
print("Simulate from t=0 to t=%d" % MAXT)

error     = np.zeros(NUM_STEPS+1)
e         = np.zeros(NUM_STEPS+1)
TRIALS    = 1000  # number of simulated trials
LRATE     = 0.01  # learning rate
Sigma     = np.zeros(TRIALS+1)
Sigma  = 1     # initializing the value of weight
for trial in range(1, TRIALS+1):
error  = 0 # initializing the prediction error
e      = 0 # initializing the interneuron
phi = mean_phi + math.sqrt(Sigma_phi) * np.random.normal()
for i in range(1, NUM_STEPS+1):
error[i] = error[i-1] + DT * (phi - phi_above - e[i-1])
e[i]      = e[i-1]    + DT * (Sigma[trial-1] * error[i-1] - e[i-1])
Sigma[trial] = Sigma[trial-1] + LRATE * (error[NUM_STEPS]*e[NUM_STEPS] - 1)

print('Plotting...')
matplotlib.style.use('default') # or 'seaborn'
# initialise graph
fig1, ax1 = plt.subplots(figsize=(6, 3))    # size of graph window

if exercise_number == 1:
ax1.set_title('Exercise 1'.format('default'), color='C0')
# Distribution plots
ax1.plot(range(steps), p_v,         'b', label='P(v)',      gid = 'AX1_Pv')
ax1.plot(range(steps), p_u_given_v, 'y', label='P(u | v)',  gid = 'AX1_Pu_given_v')
ax1.plot(range(steps), p_v_given_u, 'g', label='P(v | u)',  gid = 'AX1_Pv_given_u')

# Graph niceties: ranges and legend
ax1.set_xlim(left=0, right=500)
ax1.set_ylim(bottom=0.0, top=1.0)

if exercise_number == 2:
ax1.set_title('Exercise 2'.format('default'), color='C0')
# Distribution plots
ax1.plot(range(tsteps), phi,         'b', label='phi',      gid = 'AX1_phi')

# Graph niceties: ranges and legend
ax1.set_xlim(left=0, right=tsteps)
ax1.set_ylim(bottom=-0.1, top=5.0)

if exercise_number == 3:
ax1.set_title('Exercise 3'.format('default'), color='C0')
# Distribution plots
ax1.plot(range(tsteps), phi,       'b', label='phi',      gid = 'AX1_phi')
ax1.plot(range(tsteps), epsilon_u, 'y', label='epsilon_u',  gid = 'AX1_epsilon_u')
ax1.plot(range(tsteps), epsilon_p, 'g', label='epsilon_p',  gid = 'AX1_epsilon_p')

# Graph niceties: ranges and legend
ax1.set_xlim(left=0, right=tsteps)
ax1.set_ylim(bottom=-2, top=5.0)

if exercise_number == 5:
ax1.set_title('Exercise 5'.format('default'), color='C0')
# Distribution plots
ax1.plot(range(TRIALS+1), Sigma,     'b', label='Sigma',      gid = 'AX1_Sigma')
# Graph niceties: ranges and legend
ax1.set_xlim(left=0, right=TRIALS)
ax1.set_ylim(bottom=0, top=3)

ax1.legend()
plt.show()
```