The Great and the Good

Why Do the Rich Have a Different Moral Calculus?


Albert Loeb


Albert Loeb, father of Richard

The traditional system of justice rests on the foundation that the minds of individuals generally all have the same ability of choosing courses of action and hence they can all be equally blamed when those courses of action are wrong.

But with a modern, Physicalist worldview, we recognise that our behaviour is dictated by circumstances beyond our choosing. To return to a previous example, lawyer Clarence Darrow appealed to the compassion of the judge to spare the death penalty on Richard Loeb:

“What had this boy to do with it? He was not his own father; he was not his own mother; he was not his own grandparents. All of this was handed to him. He did not surround himself with governesses and wealth. He did not make himself and yet he is to be compelled to pay.”

Now, if this applies to blame then it applies equally to its opposite, praise.

If transgressors cannot be blamed (in the traditional, direct sense) for their deeds, then the successful cannot be praised for their achievements either.

Consider Richard Loeb’s father, Albert Loeb (1868-1924), as an example of a high achiever. After enjoying a good education, he set up a law practice in Chicago that quickly gained Sears, Roebuck & Company as a client for whom he went on to work for directly, eventually becoming vice president. He had reached the heights of social standings and was able to surround himself with wealth: a mansion in an affluent part of Chicago, a Model Farm in Michigan with a schoolhouse for the workers’ children. And governesses for his own children.

Hyde Park Herald

The home of Alfred Loeb and family, 5017 South Ellis Avenue, Kenwood, Chicago. (Barack Obama’s house is on the adjacent street, South Greenwood Avenue.)

As with other high-achievers, he was presumably proud of his achievements in life and felt that he had achieved his rewards as a result of his personal abilities without very much being credited to his fortunate circumstances in which we was born and grew up

With a Physicalist worldview, It is not just that…

Some people are born on third base and go through life thinking they hit a triple

But it is also that:

‘If you can hit a triple, that automatically puts you on third base to start with!’



How the Rich Behave

Loeb’s Model Farm, Charlevoix MI

There has recently been much general media coverage on research about how the moral behaviour and reasoning of those at the top of the social tree differs from that of the rest of us. For example, from research by Paul Piff:

  • They are more likely to lie and cheat when gambling or negotiating,
  • They are more likely to endorse unethical behaviour in the workplace.
  • They exhibit reduced empathy, favouring ‘rational’ utilitarian choices (rather than more intuitive, emotional responses) such as being more likely to take resources from one person to benefit several others.

That last is from `trolleyology’ experiments . Another ‘method’ is to equate high-status cars with high-status drivers and observe behaviour. For example, drivers of high-status cars are more likely to cut other drivers up and not stop for pedestrians at crossings.

Piff et al: 'Higher social class predicts increased unethical behavior'

‘Mean machine’: Another BMW driver fails to stop for a pedestrian.

Elsewhere, I have defined morality as being about balancing the wants of oneself with those of others. Piff frames the behaviour of the rich in terms of such a balance:

‘the rich are way more likely to prioritize their own self-interests above the interests of other people.’

(He calls this ‘the asshole effect’!)

Kathleen Vohs is another high-profile researcher in this area. Experiments of hers concluded that just thinking about money decreases empathy, shifting the balance from others to oneself. But she believes this effect is a result of a lack of interest rather than malicious. For ‘money-primed’ individuals:

“It’s not a bad analogy to think of them as a little autistic.”

In the relationship between affluence and selfishness, which is the cause and which is the effect? The cause can be one of:

  • The environment: Being rich makes you less empathetic, or
  • The agent: Being less empathetic makes you rich.

Others have questioned the quality of research like this – for its subjectivity and inadequate sample size. (Far worse is the case of Diederick Stapel, who faked the data for similar research papers.)

But even if the data is frail or faked, we are inclined to go along with their conclusions because either:

  1. they ring true with our own anecdotal experience (e.g. that BMW drivers tend to be inconsiderate of other road users) – the ‘science’ only confirms ‘what we already knew’, or
  2. we want them to be true.


Charitable Giving

Looking at donations to charity is another way of assessing how much people think of others. Crucially, for this there is a vast amount of data available to analyse, from tax returns. One study analysed donation data from 30% of U.S. tax returns, a huge set. This is not without its problems but it does overcome sample size problems. Ranking the largest 50 U.S. metropolitan areas based on the percentage of people’s income given to charity, Salt Lake City was at the top, accompanied by the Bible Belt cities of the South East. The affluent Silicon Valley cities, San Francisco and San Jose, were nearly at the very bottom. Silicon Valley has long had a reputation for low level of charitable donations. (It has also been associated with a high prevalence of the diagnosis of autism/Asperger’s syndrome.)

The story is similar in the UK, Scotland and the Midlands donate more generously (proportionately) than those from more affluent London and the South East.

Charitable giving as a function of income

Major factors that influence charitable generosity are

  • being married, and
  • regular attendance of religion services.

Religion is the factor that transforms the graph of percentage-giving-versus-income from one that declines with increasing income to a ‘U’ curve (see above). But it is only a relatively small proportion of the very wealthy that are doing the giving.

The use of charitable donations as an indicator of generosity is not straightforward – the relationship is obscured by including donations to political / ideological causes as well as traditional charitable ‘good causes’. But even after compensating for this, those who regularly attend religious services still donate more to secular ‘good causes’ than those who don’t. But this can simply be attributed to the habit of being regularly reminded of others needs at those services. The relative meanness of those who do not attend regular religious services can be attributed to not being made consciously aware of others’ needs so frequently – ‘out of sight; out of mind’.

Other factors affecting charitable giving include:

  • Living in rural rather than urban areas. (Note: those in cities are generally better educated.)
  • Increasing age (ignoring the effect of bequests).
  • Living in mixed rather than ‘gated’ communities.

It would also appear that conservatives are more generous than liberals but there is no statistically significant difference between them per se; the high level of donations of conservatives can be accounted for by their higher religious attendance.


Affluent Societies

Taking what has been said above, an overall picture emerges. Compared with more ‘traditional’ societies, in modern Western societies:

  • People are more likely to be single. Relationships have less commitment.
  • There is less attendance of religious services: less social connectedness to those living in the vicinity. Less regular exposure to those less fortunate.
  • The majority of the population now live in an urban environment: day-to-day interactions with others are more likely to be anonymous rather than with those you know personally.
  • People are better educated: moral deliberation is done with a wider perspective than the local/immediate/emotional.
  • People are more individualistic: Occupations are more specialised and there is more leisure time to define oneself by.
  • People are more affluent: they have more material goods to ‘play’ with and use, with consequent reduced contact with others. Particularly relevant here is car ownership, isolating people when tranversing between home and work.
  • People are more isolated from one another: they are likely to living in ‘good’ or ‘bad’ neighbourhoods where people are more like themselves. Their interaction tends to be more with those of their own age. This is all particularly acute for ‘gated communities’.
  • There is less dependency and there is higher monetization: we are less dependent on other specific people, and their goodwill. If we want something, we can just buy it with the minimum of personal interaction and generally from one of a number of anonymous providers.

All these factors lead to reduced empathy towards people around us. This is an effect of the environment.

However, it must be emphasized that this is a local effect. Modern Western society supports a huge population, becoming a more homogeneous ‘global village’ whereas ‘traditional’ societies tend to be small and much less tolerant to outsiders.

On balance, a reduction in local empathy might not be a problem if society was quite uniformly affluent. But there are huge societal differences. The reduced empathy of the powerful leads to narcissism and insensitivity and works to the detriment of the weak.

As already said, morality is about balancing the wants of the individual against those of others within society.

  • A ‘traditional’ environment is likely to be physically harsh. This balancing must be skewed towards the wider needs of the group. The community needs religion to bind itself together. There must be strongly codified acceptable behaviours.
  • A modern, Western environment is physically benign and can support greater independence and the moral balancing can shift towards the individual.

This shift is most pronounced for the most affluent.


Entitlement and Narcissism

In extreme cases, the balance is completely shifted towards the self. Such people have:

  • An affluence which means that all ‘basic’ worldly needs are easily met: food, shelter, safety, belonging and self-respect.
  • A lack of empathy.
  • A ‘cold’ application of reason that directs action.


  • A preparedness to sacrifice others (dispassionately) for a greater good, or
  • Completely no personal regard for others.


The former case of sacrificing others is one of ‘extreme Utilitarianism’ – a preparedness or a sense of entitlement to act. Moreover it is an entitlement to act alone (based just on one’s own perceptions of reality). There is a gradual transition from personal morality to political morality here. A government department is entitled to take actions that impersonally sacrifice some people for others (buy drugs for one medical condition at the expense of others for another). A political leader, supported by the institution of government is entitled to take actions that impersonally sacrifice some people for others (wage war). But when a group of insufficient size thinks it is entitled to impersonally sacrifice some people for others, it is terrorism.

(The problem with the classic ethical thought experiments such as

is that these scenarios apply ordinarily to groups, not individuals.)

An example is the case of Anders Behring Breivik, responsible for the 2011 terrorism in Oslo and on Utøya. Before his killing spree, he released a 1500-page account of his worldview concerning the preservation of European culture against Islamisation. Although delusional (and homophobic and misogynistic and …), there is an intellectualized dimension to his cause, and a willingness to enforce significant sacrifices in order to further that cause (incarceration for himself but death for many others). Breivik would probably diagnose his motivations as part of his personal self-actualization. Psychiatrists on the other hand attributed his acts to narcissistic personality disorder (exacerbated by Asperger’s).

The latter case of having no regard for others is one of megalomania, for which there are plenty of examples throughout history. Its juvenile form is one of insufficient competence, such as with the case of Richard Loeb.


(This is the twentieths part of the ‘From Neural Is to Moral Ought’ series.)

Posted in Uncategorized | 1 Comment

My Brain Made Me Do It


Crime and Punishment

I previously looked  at shame and guilt and the confusion between the two. One distinction was that guilt focussed on bad acts whereas shame focussed on bad agents who caused those acts.

New Scientist


Also previously, morality has been defined as the balancing of the wants of the individual against that of others within society. (Note:  The moral code can vary between individuals within that society.) It promotes ‘good’ behaviour and discourages ‘bad’ behaviour for mutual benefit. A culture can be nurtured within a society which promotes this through:

  • internal guilt – when private, and
  • external shame – when found out.

The justice system institutionalizes this cultivation. It promotes ‘right’ behaviour and discourages ‘wrong’ behaviour for mutual benefit. From practicalities, it is necessarily a rule-based system, which only approximates to a society’s morals, but its blunt edges can being smoothed off by the expertise of its judges. (Note: the legal code can vary between individuals within that society.)

If the moral landscape is a surface over which height above sea level is indicative of how ‘good’ or ‘bad’ an action is in a specific location in place and time then the ‘legal landscape’ is like a canyon – there is clear separation between what is ‘right’ and what is ‘wrong’.

The justice system cultivates good behaviour through some combination of:

  • Retribution: transgressors morally deserve to be punished.
  • Institutional retribution: transgressors are punished by a third-party in order to prevent victims or others taking retribution, thus avoiding feuds and vigilantes.
  • Deterrence: transgressors should be punished in order to deter others from offending and them from re-offending.
  • Incapacitance: preventing transgressors from re-offending through incarceration/detention/imprisonment.
  • Exile: preventing transgressors from re-offending by outcasting from the community.
  • Rehabilitation: re-educating / re-habituating / re-integrating offenders back into the community so that they will not re-offend.
  • Restoration: reconciliation between the offender and their victims, to prevent recidivism.

It transforms an internal ‘guilt’ to a very external ‘guilty’. To plead guilty is to acknowledge the wrong-doing. To be found guilty need not involve guilt on the part of the transgressor.

As there is the shameful actor / guilty act distinction, the justice system makes the distinction of:

A crime is committed when a guilty mind performs a guilty act, but not when just one of these occurs. In this double-example:

  • A is planning to poison her husband at home tomorrow, once she has bought some rat poison. But her husband ends up not being at home the next day so doesn’t get poisoned. Mrs. A is not guilty of killing her husband.
  • On her return from buying the rat poison, she reverses onto the drive, half-thinking about her forthcoming crime. Not hearing her, Mr A. steps back from the garden onto the drive. She drives over him and he dies. Mrs. A is not guilty of killing her husband because she drove over him without intention, recklessness or negligence.

(Obviously, the reason Mr. A was not at home the next day to be poisoned was because he was in the morgue.)

The former is an example of mens rea without actus reus. The latter is an example of actus reus without mens rea. Neither make Mrs. A guilty even though she had intention to kill Mr. A and Mr. A was killed by means of Mrs. A.


  • In a moral system, wrongness can be due to a wrong act or a wrong actor (through guilt or shame respectively).
  • In the legal system, criminality requires both a wrong act and a wrong actor (‘actus reus’ and ‘mens rea’ respectively).

Note: As always here, this is a simplification:

  • Mens rea is sufficient for some crimes in the interest of public safety such as counter-terrorism.
  • Actus reus is sufficient for some crimes in the case of ‘strict liability’ that promotes public safety in area such as food/employment standards.


Bad Brains Cause Bad Acts

Traditionally we are all held to be equally responsible before the law but it is increasingly apparent that this is not true. Our behaviour is affected by things beyond our control.  Our responsibility is diminished by varying degrees.  In some cases it is clear there is no mens rea – they cannot stop themselves from committing an actus reus. That is, their mind cannot stop their body. They have an inability to control themselves rather than having bad intentions, negligence or recklessness. It would be unfair to hold them responsible for the change in behaviour. Their behaviour is determined outside their mind.

Consider these cases:

  • Phineas Gage is the classic, most celebrated case of a change to the brain causing a change in behaviour. Working on building railroads in 1848, an explosion blew a tamping iron (rod) straight through his head, leaving a gaping hole in his brain. He miraculously survived but his personality was changed from that of a ‘responsible’ foreman beforehand to an irreverent, drunken brawler. Friends said he was “no longer Gage”. A generally-observable physical change to the brain had caused a generally-observable change in behaviour.
  • Charles Whitman personally fought his “unusual and irrational thoughts” until in 1966 he killed his wife and mother and went on a killing spree on a university campus. Beforehand he had written “After my death I wish that an autopsy would be performed on me to see if there is any physical disorder.” The autopsy revealed a brain tumour. A physical change to the brain (phenomenologically observed by Whitman but not apparent to others) presumed to have caused a dramatic change in behaviour.
  • Similarly in 2000, a man suddenly developed inappropriate sexual behaviour and was convicted. The onset of paedophilia normally occurs at a young age but this man was 40. He complained about headaches and balance problems and was given an MRI scan as a result. The scan revealed a brain tumour which was subsequently removed. His sexual urges disappeared but returned in 2001. Another MRI scan revealed a regrowth of the tumour. Again, on removal of the tumour, his behaviour was corrected. Twice-over, a physical change to the brain (phenomenologically observed himself and very apparent to others) was correlated with a dramatic change in behaviour. This was achieved through the use of new, non-invasive technology on a live subject.
  • In 2011, Trevor Hayes was convicted of armed robbery. But then a massive brain tumour was found to be the cause of his “aggressive and compulsive behaviour” and affected his ability to exercise self-control. The judge overturned the verdict saying
    • “no court would conclude that there is a significant risk to the public now the tumour has been removed”
    • “There is a direct link between the size of the tumour and his behaviour. The evidence appears to be clear.”
    • “It is such an unusual scenario”

But it is not clear that the Hayes case really is so unusual. Now that we having the scanning technology, perhaps everyone convicted of a serious crime should have their brain scanned?

Hayes didn’t ask to have a brain tumour which then caused his criminal behaviour. Therefore, it seems wrong to punish him. Fortunately for him, we live both in enlightened times and in times of sufficiently advanced scanning technology.

Possibly in the future, there will be even better technology and we will be able to detect more subtle physical abnormalities of the brain. Purely speculatively for example, a ‘connectometer’ (‘connectome-meter’) might be able to create a coarse connectome of the brain which can be compared against reference connectome maps from which we can physically diagnose mental conditions that are more difficult to diagnose psychologically (example: schizophrenia). It would then seem wrong to punish them and we should pity those such persons who are in prison now because we do not have that technology.

And the net can be spread wider. Where does this end?

It naturally leads to legal defences trying to use neuroscience to make a physical rather than psychological connection to the crime: the defendant is not guilty:

“My brain made me do it!”


Good Brains Cause Bad Acts

So, bad brains can cause bad acts. But good brains can also cause bad acts – if there is a bad environment. An example of a direct relationship is that between criminal behaviour and the exposure to lead in paints and fuel.

But an argument can be made for much wider application…

Clarence Darrow was the high-profile agnostic  defence lawyer in the famous ‘science versus religion’ Scopes Monkey Trial of 1925 which challenged the ban on teaching evolution in schools. The year earlier, he had defended the notorious ‘Leopold and Loeb’  pair in another ‘trial of the century’. Inspired by Nietzsche’s concept of Übermensch acting above the law, the two rich-kid prodigies applied their superior intellects to committing a ‘perfect crime’ by murdering a 14-year neighbour. They failed. In court, Darrow’s task was to get the judge to incarcerate them rather than letting the jury hang them.

Talking of Richard Loeb in his closing speech, Darrow said:

“Nature is strong and she is pitiless. She works in her own mysterious way, and we are her victims. We have not much to do with it ourselves. Nature takes this job in hand, and we only play our parts. …

“What had this boy to do with it? He was not his own father; he was not his own mother; he was not his own grandparents. All of this was handed to him. He did not surround himself with governesses and wealth. He did not make himself and yet he is to be compelled to pay.”

Loeb had the best of brains in what was outwardly the best of environments but circumstances led him to commit the worst of crimes. Darrow’s response on behalf of Loeb was effectively:

 “My environment made me do it!”

Darrow succeeded. The judge sentenced Leopold and Loeb to Life Plus 99 Years.


Corpus Reus

The ‘self’ argument

“My brain made me do it!”

and the ‘not-self’ argument

 “My environment made me do it!”

take us to determinism. The act was determined outside of ‘my’ control, where the ‘my’ here refers to ‘mind’. But if the brain is the mind and the physical world determines what we do, whether it is the physics of our insides or the physics of our outsides, it makes no difference.

A ‘physicalist’ tries to explain everything ultimately in non-intentional (‘mechanical’, ‘physical’) terms. Whether things are specifically deterministic or not is actually not particularly important. What is important is that some phenomena are not ring-fenced as being beyond such a physical explanation. Attempts are made to explain ‘mind’ in physical terms.

All this is a problem for Cartesian dualists and ‘libertarians’:

 “If the world is deterministic then there is no free will and hence we do not have moral responsibility.”

Libertarians equate Free Will and Indeterminism. Others differentiate. Conventionally:

  1. Free Will and Indeterminism produces ‘Libertarianism’: events are not always causally determined. We are free to ‘make a difference’.
  2. Free Will and Determinism produces ‘Compatibilism’: the world is deterministic yet we still claim there is ‘free will’.
  3. No Free Will and Determinism produces ‘Hard Determinism’: Liberty is a practical consideration, and
  4. No Free Will and Indeterminism produces ‘Hard Incompatibilism’: determinism is a red herring. We cannot have free will either way.

The legal system is intimately associated with Dualism and Libertarianism: ‘mens rea’ (a guilty mind) is the cause of ‘actus reus’ (a guilty act).

A physicalist could subscribe to any of the non-libertarian positions 2, 3 and 4 above:

  1. We still have ‘Free Will’ but ‘new-style’ Free Will is just a bit different from what we have previously understood ‘Free Will’ to be. We will still basically judge people as if they had ‘old-style’ Free Will.
  2. The moral/legal system will need to be modified but it will take time for the changes to happen.
  3. Whether the world is deterministic or indeterminate, the philosophical arguments around Free Will have little bearing on the practical considerations of the judicial system. Essentially, Free will is irrelevant.

And a physicalist obviously would not subscribe to dualism. As I have said previously, the dualist concept of ‘free will’ does not translate across to physicalism. From a physicalist perspective, it just doesn’t make sense to say ‘there’s no such thing as Free Will’. Free Will’s physicalist equivalent is a combination of:

  • Conscious Will’, the conscious feeling that an agent has caused something that they have willed when they see the corresponding action, and
  • ‘Freedom’ , itself a combination of an ability to predict and yet be unpredictable oneself.

For physicalist ‘Conscious Will’:

  • The conscious feeling of having caused something can be related to ‘mens rea’, and
  • The corresponding action can be related to ‘actus reus’.

But it is not possible to separate ‘mind’ and matter. Ultimately, there is only what might be described as ‘corpus reus’ – the body is guilty. Mind/brain is embodied and cannot be considered separately from the body. ‘Moral responsibility’ lies within the entire person.

 “It was the body that is me that did it!”

This understanding of responsibility will not be the same thing as a libertarian would understand from the same word.



Moral and legal responsibility is typically associated with the ‘ability to control’ but responsibility is also associated with someone or something being the ‘primary cause’ and held ‘accountable’, without there being ‘control’.

Loeb’s interest in crime novels and habit of lying supposedly started as a reaction to the strict disciplinarian teaching approach of his governess, Emily Struthers. It is possible that if he had had a different governess then this first step towards murder would not have been taken.

This is the ‘butterfly effect’: could a butterfly flapping its wings in Brazil cause a tornado in Texas? Contrary to what is frequently said, the answer is ‘no’. It is true that:

  • a deterministic world with a butterfly in a precise point in space and time could result in a tornado somewhere else later on, whereas
  • the very same deterministic world with the exact same starting point except for the butterfly would not result in the tornado occurring.

But the point of the ‘butterfly effect’ is that it never is possible to recreate the same starting point with sufficient accuracy and so we will never know. It is only possible in computer simulations. And even in those computer simulations, we would not say that the butterfly is the ‘cause’ of the (virtual) tornado any more than a small difference somewhere else (such as the presence/absence of a leaf) would be. Countless other minor changes would also have led to a substantially different outcome, in time.

But what would have happened if Richard Loeb was substituted by someone else driving around the streets of Chicago with Nathan Leopold on the afternoon of 21 May 1924?

Based on our ability to predict consequences, substituting Loeb has a greater chance of changing the fate of Bobby Franks more than anyone apart from the possible exception of Nathan Leopold. The chance is far higher than if Struthers was substituted. The pair had the greatest effect on the bad consequences and hence we say that they are responsible. This is regardless of any moral capacity (mental capability), any existence of Free Will or otherwise. They are irrelevant. If we want to prevent another occurrence of such an event, it is them that we first examine and this is what we mean by being responsible for an act.



The traditional legal system is wrested on dualist foundations. It makes a distinction between intentional and unintentional actions and punishes freely-chosen intentions that are bad.

Centuries ago, all human behaviour was attributed to the mind. For example, epileptics were considered to be possessed by the devil and punished accordingly. Over time, we have slowly shifted towards the physicalist position that all behaviour is determined by matter and we no longer see responsibility in terms of choice and blame (there is the transition from Dualist ‘mens rea’ and ‘actus reus’ to Physicalist ‘corpus reus’).

Apportioning responsibility is then a consequentialist activity that is part of identifying how similar undesirable situations can be prevented in the future.  It is a risk-based approach.

We will still ‘punish’ epileptics when there is neither mens rea nor actus reus. For example, we will still ban them from driving. But this is no different from punishing others for circumstances beyond their control. For example, the old, the young and many disabled are also ‘banned’ from driving (we might question the use of the word ‘banned’ but the practical effect is the same). They could all protest:

  • ‘It’s not fair! I didn’t ask to be born with epilepsy.’
  • ‘It’s not fair! I didn’t ask to have poor eyesight / slow reaction times / a raised chance of having a heart attack in my advanced years.’
  • ‘It’s not fair! I didn’t ask to have poor impulse control in adolescence.’

There is ‘punishment’, but without the sense of guilt, shame or blame.

And then we also punish others for what we feel should be within their control but apparently isn’t, such as drunk driving. But they might respond:

  • ‘I didn’t ask to be born genetically predisposed to have poor impulse control.’

We balance the risks:

  1. A young driver has good coordination and fast reaction times but poor impulse control and a lack of experience.
  2. A mature driver with infrequent epileptic seizures may have high skill, good coordination, good impulse control but there is a significant risk of causing an accident as a result of having another seizure.
  3. An elderly driver may have high skill and very good impulse control but have deteriorating coordination, eyesight and reaction times. There may also be significant risk of loss of control (heart attack).
  4. A middle-aged drunk driver may be better than the above in all respects except in their poor impulse control and recklessness.

The drunk driver may be no more of a risk than a moderate case of one of the other risk categories. There needs to be an assessment of risks in all cases, considering practical preventative measures in a non-judgemental way.


Prevention and Deterrence

To prevent criminal activity, we can try to improve individuals – such as removing their brain tumours! But there is rather more scope in improving their environment into which the individuals grow, over a long period of time.

But sanctioning transgressors is a major way of preventing crime. I am avoiding the word ‘punish’ here; it does not help. The sanctioning should not be for retribution but for deterrence.


Treatment not Punishment

The convicted mentally ill are treated in a ‘secure hospital’ rather than punished with a prison sentence. But with determinism, all the convicted are deemed to be ‘ill’ to some degree. All prisons become ‘secure hospitals’. We become more compassionate towards convicts. We move away from retributive justice. We are not deliberately trying to make life worse for them. Detention is purely a practical consideration – for the benefit of wider society (protection and deterrence) as well as that of the convicted individual (reform). But there are economic consequences. Incarceration is expensive and treatment is even more so. It seems absurd to provide an environment for convicts on the inside that is better than that for some non-criminal poor on outside – this is difficult to justify. And with this argument, there is an incentive to commit crime – negating deterrence. It is better to spend money improving life for the worst-off outside. Then as standards of living improve on the outside, what is acceptable for the criminal inside will improve too. This is purely a practical issue.


Executive Responsibility

After the exposing of corporate misdemeanours, executives cannot respond with:

 ‘It’s not fair! I didn’t know anything about it. How can you blame me?’

even if they truly did not know. Executives are not directly involved in particulars but they should still be expected to take responsibility and be held responsible because it is part of their job to ensure that those below them are acting appropriately. Despite there being no mens rea, they need to be sanctioned as a deterrent to other executives to motivate them to act appropriately.


Moral Luck

It is commonly felt that driving when intoxicated is significantly worse when it results in injury to others than when it does not – that reckless mens rea without actus rea is less blameworthy than reckless mens rea with actus rea. If we look at future risk, the recklessness of the driver is the same in both cases and so, according to the argument here, they should be sanctioned in the same way. The only real difference is whether the risk does or doesn’t pay off i.e. luck – ‘moral luck’!


Moving Away from Mens Rea and Actus Reus

In both the examples above (Executive Responsibility and Moral Luck), there is a moving away from the requirements for both Mens Rea and Actus Reus to a risk-based ‘Corpus Reus’ approach that is:

  • less blameworthy: responsibility is about identifying where to look for preventative solutions and not about control and retribution.
  • more compassionate: we are more sympathetic towards criminals if we believe they have less than ideal control over events in a physical world.

(But we must recognise that it is also potentially dangerous in going too far in sanctioning.)


  • Mens Rea is associated with (idealized) rational decision-making
  • Actus Reus is associated with specific acts being good or bad.
  • Corpus Reus is associated with the embedded virtue of the individual. As such, it is consistent with virtue ethics.


Getting Rid of Blame

There is nothing remarkable here in the argument that we should abandon blame, punishment and retribution. It is an obvious consequence of moving away from a Dualist to a Physicalist justice system. For example, take these three Neuroscientist ‘heavyweight’ opinions:

Mike Gazzaniga:

‘with determinism there is not blame, and, with not blame, there should be no retribution and punishment’

David Eagleman:

‘Blameworthiness should be removed from the legal argot’.

Joshua Greene (he of the  ‘From Neural Is to Moral Ought’ paper) and Jonathan Cohen:

`We foresee, and recommend, a shift away from punishment aimed at retribution in favour of a more progressive, consequentialist approach to the criminal law’.


The Return of Blame

And yet, blame may still have a role to play in a purely practical consequentialist approach to justice. It may be ‘unfair’ to blame people for do things they could not have not done but cultivating blame in a society will provide some deterrence and hence promote the self-regulation of people for best mutual well-being. It has the same role as ‘shame’ and ‘guilt’.

It seems that blame has been kicked out the front door of morality, only to be let back in through the back door of pragmatism.

The same is true of its opposite, praise.

And with this, we seem to end up with a ‘Hard’ position:

  • If there is Free Will, we blame agents for the bad actions they cause.
  • If there is no Free Will, we still blame agents for their bad actions.


  • If there is Free Will, we praise agents for the bad actions good cause.
  • If there is no Free Will, we still praise agents for their good actions.

The issue of Free Will is irrelevant. As Greene and Cohen said:

‘For the law, neuroscience changes nothing and everything’.

(This is the nineteenth part of the ‘From Neural Is to Moral Ought’ series.)

Posted in Uncategorized | 1 Comment

Shallow Learning


The Cerebellum

When we think of brain we conjure up an image of the cerebral cortex – that that is so large in humans that it wraps all around the top. We do not think of the Cerebellum (the Latin ‘little brain’) tucked underneath this wrinkly cortex at the back, itself having two halves or cortex – the ‘cerebellar cortex.

Cerebrum and Cerebellum

The huge, glamorous Cerebrum is part of the ‘neo-mammalian’ forebrain and is what seems to provide us with the extra something that distinguishes us from other creatures. The Cerebellum is the poorer, more ancient cousin that is part of the more basic, ‘proto-reptilian’ hindbrain and a bit of a spare part. A human cannot survive with significant parts of their Cerebrum missing but a human can survive without their Cerebellum entirely – consciously but with seriously affected motor control. But for normal development:

The number of neurons in the Cerebellum significantly outnumber those in the Cerebrum.

Surprisingly, the ratio of cerebellar to cerebral neurons is quite constant across a large range of creature at a value of about 3.6.

The huge increase in human cerebral neurons that we associate with cognition has been accompanied by a proportion increase in cerebellar neurons that are association with smooth motor actions.

Reptile brain: Even in a reptile’s brain, the forebrain (cerebrum) is larger than the cerebellum in volume – but not in the number of neurons.


The Cerebellum and Artificial Neural Networks

The cerebellum is undoubtedly a simpler structure in that it has a much more regular structure. The cerebellar cortical sheet is folded up into regular grooves in contrast to the more familiar wrinkly cerebral cortex. This makes it more amenable to understanding – a better starting point both:

  • scientifically, as a way to understand the brain, and
  • in ‘bioinspired’ engineering ‘Artificial Neural Networks’, as a way to build more intelligent, powerful and efficient computers.

The engineering helps the scientific. Being able to build and then successfully run a physical simulation of a model of the cerebellum is a vastly superior to conjecturing theories.


Deep Learning

Unfortunately, progress in the usefulness of simulated neural networks was disappointing. It has proven very difficult to get neural networks working for more than 3 layers.

Unfortunately, progress in simulated neural networks was disappointingly slow and it gave artificial neural networks a bad name. It was very difficult to get them working for networks of more than 3 layers (stepping up from an artificial cerebellum to an artificial cerebral cortex) , which was needed if they were to do anything useful. But small progress over many years yields results and this is now a key technology for Google / Siri speech recognition. A leader in this field is Geoffrey Hinton who coined the name for this sub-discipline: ‘Deep Learning’.

A central, recurring concept on my blogsite is the ‘hierarchy of predictors’, with frequent references to Karl Friston’s ‘variational free energy’ theory. Hinton’s deep learning engineering work and its very terminology is the foundation for Friston’s work. Hinton is a co-author and former colleague of Karl Friston at UCL.

Photo credit: Michael Tyka.

Deep Learning 1: Tree in field with clouds, as perhaps ‘seen’ by a Canon EOS 5Ds.

Credit: Google

Deep Learning 2: Tree in field with clouds, as perhaps ‘seen’ by some deep layer within your brain! Google’s deep learning network tries to relate features in the original image with those it has seen before. Clouds get associated with sheep-like creatures.

Credit: Google

Deep Learning 3: Canon EOS 5Ds, as perhaps ‘seen’ by some deep layer within your brain! Produced using DreamScope


Shallow Learning

Artificial Neural Networks are the poorer, more ancient, less glamorous  cousin of ‘Deep Learning’ just as the cerebellum is the poorer, more ancient, less glamorous cousin of the cerebral cortex. They are examples of ‘shallow learning’ as it were.

To get to deep learning, we must first wade through shallow learning. A seminal starting place for this is Frank Albus’s paper “A Theory of Cerebellar Function” which is available at various places on the interweb as a scanned PDF such as here. Below, I provide a text (searchable) version (but with no guarantees about being completely error-free).


A Theory of Cerebellar Function


Mathematical Sciences 10 (1971), 25-61



Copyright 1971 by American Elsevier Publishing Company, Inc.



Cybernetics and Subsystem Development Section

Data Techniques Branch

Goddard Space Flight center

Greenbelt, Maryland

Communicated by Donald H. Perkel



A comprehensive theory of cerebellar function is presented, which ties together the known anatomy and physiology of the cerebellum into a pattern-recognition data processing system. The cerebellum is postulated to be functionally and structurally equivalent to a modification of the classical Perceptron pattern-classification device. It is suggested that the mossy fiber → granule cell → Golgi cell input network performs an expansion recoding that enhances the pattern -discrimination capacity and learning speed of the cerebellar Purkinje response cells.

Parallel fiber synapses of the dendritic spines of Purkinje cells, basket cells, and stellate cells are all postulated to be specifically variable in response to climbing fiber activity. It is argued that this variability is the mechanism of pattern storage. It is demonstrated that, in order for the learning process to be stable, pattern storage must be accomplished principally by weakening synaptic weights rather than by strengthening them.



A great body of facts has been known for many years concerning the general organization and structure of the cerebellum. The regularity and relative simplicity of the cerebellar cortex have fascinated anatomists since the earliest days of systematic neuro-anatomical observations. In just the past 7 or 8 years, however, the electron microscope and refined micro-neurophysiological techniques have revealed critical structural details that make possible comprehensive theories of cerebellar function. A great deal of the recent physiological data about the cerebellum come from an elegant series of experiments by Eccles and his co-workers. These data have been compiled, along with the pertinent anatomical data, in book form by Eccles et al. [5]. This book also sets forth one of the first reasonably detailed theories on the function of the cerebellum. Another theory, published in 1969 by Marr [11], in many ways extends and modifies the theory of Eccles et al.

The theory presented here was developed independently of the Marr theory but agrees with it at many points, at least in the early sections. This article, developed from a study of Perceptrons [15] and memory model cells [1], applies these results to the structure of the cerebellum as summarized by Eccles et al. [5]. The theory presented here extends the Marr theory and proposes several modifications based on principles o f information theory. These extensions and modifications relate mainly to the role of inhibitory interneurons in the learning process, and to the detailed mechanism by which patterns are stored in the cerebellum.



To credit each piece of information presented in this section to its original source would be very tedious. Everything in this section is taken directly either from Eccles et al. [5] or Fox et al. [7]. Therefore a single reference is now made to these sources and to the extensive bibliographies that appear in them.

A. Mossy fibers

Mossy fibers constitute one of the two input fiber systems to the cerebellum. Input information conveyed to the cerebellum via mossy fibers is from many different areas. Some mossy fibers carry information from the vestibular system or the reticular formation, or from both. Others carry information that comes from the cerebral cortex via the pons. The mossy fiber system that has been most closely studied relays information from the various receptor organs in muscles, joints, and skin. Mossy fibers that arrive via the dorsal spinal cerebellar tract are specific as regards modality of the muscle receptor organ, from either muscle spindles or tendon organs, and have a restricted receptor field, usually from one muscle or a group of synergic muscles.

Mossy fibers from the ventral spinal cerebellar tract are almost exclusively restricted to Golgi tendon organ information but are more generalized as regards specific muscles than those from the dorsal spinal cerebellar tract. The ventral tract fibers seem to signal stages of muscle contraction and interaction between contraction and resistance to movement of a whole limb. Other mossy fibers carry information from skin pressure receptors and joint receptors. There are continuous spontaneous discharges on most mossy fibers, at rates between 10 and 30 per second, even when the muscles are completely relaxed.

Mossy fibers enter the cerebellum and arborize diffusely throughout the granular layer of the cortex. A single mossy fiber may send branches into two or more folia. These branches travel toward the top of the folia, giving off further branches into the granular layer of the sides of the folia, finally terminating in an arborisation at the top of the folia. Each branch of a mossy fiber terminates in a candelabrum -shaped arborisation containing synaptic sites called mossy rosettes. There is minimum distance of 80-100µm between rosettes from a single mossy fiber. It is estimated that each branch of a mossy fiber entering the granular layer of the cerebellum produces from 20 to 50 or more rosettes. Thus a single mossy fiber may produce several hundred rosettes considering all its branches. The mossy rosettes are the site of excitatory synaptic contact with dendrites of the granule cells. The mossy fibers also send collaterals into the intra-cerebellar nuclei, where they make excitatory synaptic contact with nuclear cells.


B. Granule Cells

The granule cells are the most numerous cells in the brain. It is estimated that in humans there are 3 x 1010 granule cells in the cerebellum alone. Granule cells possess from one to seven dendrites, the average being four. These dendrites are from 10 to 30µm long and terminate with a characteristic claw-shaped ramification in the mossy rosettes. In view of the spacing between rosettes on a mossy fiber, it is highly unlikely that a granule cell will contact two rosettes from the same mossy fiber. Thus an average granule cell is excited by about four different mossy fibers. Since approximately 20 granule cell dendrites contact each rosette, this means that there are about five times as many granule cells as mossy rosettes, and at least 100-250 times as many granule cells as mossy fibers.  Since a mossy fiber enters several folia, there may even be four or five times this many granule cells per mossy fiber.

Each granule cell gives off an axon, which rises towards the surface of the cortex. When this axon reaches the molecular layer, it makes a T-shaped branch and runs longitudinally along the length of the folia for about 1.5mm in each direction. These fibers are densely packed and are only about 0.2-0.3µm in diameter. The parallel fibers make excitatory synaptic contact with Purkinje cells, basket cells, stellate cells, and Golgi cells.


C. Golgi Cells

Golgi cells have a wide dendritic spread, which is approximately cylindrical in shape and about 600µm in diameter (see Fig. 1). This dendritic tree reaches up into the molecular layer, where it is excited by the parallel fibers, and down into the granular layer, where it is excited by the mossy fibers. The Golgi axon branches extensively and inhibits about 100,000 granule cells located immediately beneath its dendritic tree. Every granule cell is inhibited by at least one Golgi cell. The Golgi axons terminate on the mossy rosettes, inhibiting granule cells at this point. Fox et al. [7] state that the axon arborisations of neighboring Golgi cells overlap extensively, so that two or more Golgi cells frequently inhibit a single granule cell. Note the overlapping fields shown in Fig. 3. This overlapping is a point of disagreement between Eccles et al. [5] and Fox et al. [7]. It appears, however, that Golgi cells must overlap, considering their size and that there are approximately 10% as many Golgi cells as Purkinje cells.

James Albus: A Theory of Cerebellar Function

Fig. 1

FIG. 1. A typical Golgi celI. Its arborisations extend throughout an approximately cylindrical volume 600µm in diameter.

The size of the dendritic spread of the Golgi cell as shown in Figs. 1 and 3 is a point of some uncertainty. Eccles et al. [5, page 205 and Fig. 116] state that the spread of the Golgi dendritic tree is about three times that of a Purkinje cell (i.e., 600-750µm). However, drawings by Cajal [2] and Jakob [10], and statements and drawings elsewhere in Eccles et al. [5, page 60 and Fig. 1] seem to indicate the dendritic spread for Golgi cells to be only slightly larger than that of Purkinje cells (i.e.,  250-300µm). However, even with a dendritic spread of only 300µm, the Golgi dendritic fields would still have significant overlap, as can be shown by drawing 300µm diameter circles around the Golgi cell bodies in Fig. 3.


D. Purkinje Cells

The Purkinje cell has a large and very dense dendritic tree. The dendritic tree of the Purkinje cell is shaped like a flat fan and measures on the average about 250µm across, about 250µm high, and only about 6µm thick, as shown in Fig. 2. The flat face of this fan is positioned perpendicular to the parallel fibers that course through the branches of the tree. It is estimated that around 200,000 parallel fibers pierce the dendritic tree of each Purkinje cell, and that in passing virtually every parallel fiber makes a single synaptic contact with the dendrites of the Purkinje cell. At the site of a parallel fiber Purkinje dendritic synapse, the parallel fiber enlarges to about 1µm in diameter and is filled with synaptic vesicles. A spine grows out of the Purkinje dendrite and is enclosed by an invagination of the enlarged part of the parallel fiber.

James Albus: A Theory of Cerebellar Function

Fig. 2

FIG. 2. A typical Purkinje cell. its dendritic tree is restricted to a volume approximately 250µm x 250µm x 6µm.

A unique characteristic of the Purkinje cell is that there is virtually no intermingling of it s dendritic tree with that of other cells. The Purkinje cell bodies are beet shaped and about 35µm in diameter. They are scattered in a single layer over the cortex at intervals of about 50µm along the direction of the parallel fibers, and about 50-100µm in the transverse direction. Thus the fan-shaped dendritic trees overlap in the transverse direction but are offset in the longitudinal direction sufficiently so as to not intermingle. Figure 3 shows a top view looking down on the packed Purkinje dendritic trees. The trees are about 6µm thick and are separated by about 2-4µm. Thus a parallel fiber encounters a different Purkinje dendritic tree every 8-10µm. Since a parallel fiber synapses with virtually every Purkinje dendritic tree it passes, a 3mm parallel fiber contacts about 300 Purkinje cells.

James Albus: A Theory of Cerebellar Function

Fig. 3


FIG. 3. View of cerebellar cortex looking down on top of Purkinje dendritic trees. Purkinje cells are shown here spaced approximately every 50µm in the longitudinal direction and every 60µm in the transverse direction. They are staggered so that the dendritic trees do not intermingle. Four Golgi cells are shown with the outline of their area of arborisation traced. There is one Golgi cell to every nine Purkinje cells. Note the extensive overlapping of Golgi arborisation. Each point on the cortex is subject to influence by about nine different Golgi cells.


Purkinje cell axons constitute the only output from the cerebellar cortex. These axons make inhibitory synapses with the cells of the cerebellar nuclei and of the Deiters nucleus. In addition, Purkinje axons send recurrent collaterals to other Purkinje cells, basket cells, stellate cells, and Golgi cells.


E. Basket Cells

The basket cells also have flat fan-shaped dendritic trees, which extend upward in the 2-4 µm  spaces between Purkinje dendritic layers. Basket dendritic trees are much less dense than those of Purkinje cells, but cover roughly the same area. Basket dendrites also receive excitatory synaptic contacts from parallel fibers via dendritic spines. Basket cell dendritic spines are much sparser, more irregularly spaced, longer, and thinner than Purkinje spines. They are very often hook shaped. Basket cell bodies, about 20 µm in diameter, are located in the lower third of the molecular layer. Basket cells are 15%-20% more numerous than Purkinje cells.

Basket cells send out axons in the transverse direction, perpendicular to the parallel fiber pathways. These axons branch and send descending collaterals, which makes strong inhibitory synapses around the preaxon portion of the Purkinje cells. They also send ascending collaterals into the Purkinje cell dendritic trees, where they form further inhibitory synapses. Each basket cell inhibits about 50 Purkinje cells over an elliptical area about 1000µm x 300µm. The basket cells do not inhibit the Purkinje cell immediately adjacent, but begin their inhibitory activity one or two cells away, and inhibit Purkinje cells out to about 1mm away in the transverse direction. Thus any parallel fiber that excites a Purkinje cell is not likely to also inhibit the same Purkinje cell via a basket cell.


F. Stellate Cells

Stellate cells have dendritic arborisation very similar to that of basket cells, although somewhat smaller. On the basis of axon distribution, there are two types of stellate cells. Stellate “a” cells send axons into Purkinje dendritic trees immediately adjacent, whereas stellate “b” cells send their axons transversely, making inhibitory contact with Purkinje dendrites in an area similar in size, shape, and relative position to that of basket cells. Functionally, the main distinction between basket cells and stellate “b” cells seem to be that stellate “b” cells are located higher in the molecular layer and send few, if any, axon collaterals to the Purkinje pre-axon, or “basket” region.  however, there are many intermediate forms and the cell types seem to change progressively from basket cells in the upper granular layer to stellate “b” cells in the mid and upper molecular layer. Thus in this article the basket cells and stellate “b” cells will be assumed to perform roughly the same functions, which include receiving excitatory inputs from parallel fibers and transmitting inhibitory signals to Purkinje cells.


G. Climbing fibers

A second type of input fibers, the climbing fibers, also enters the cerebellum. These fibers are distinguished by the fact that each Purkinje cell receives a single climbing fiber in a 1: 1 fashion. They are called climbing fibers because they contact the Purkinje cell at the base of its dendritic tree and climb up the trunk of the tree, making repeated strong excitatory synaptic contacts. A single spike on a climbing fiber can evoke a complex burst of Purkinje activity. The exact nature of this activity is not entirely clear. Observations by Thach [17] seem to indicate that this complex burst of activity consists of a single Purkinje axon spike followed by several milliseconds of spike-like activity propagating throughout the Purkinje dendritic tree. This dendritic activity is accompanied by intense cell depolarization and a pause in spontaneous Purkinje axon spike activity for 15-30ms. This depolarization and pause was termed the inactivation response by Granit and Phillips [8].

The climbing fibers are usually thought to originate primarily in the inferior olivary nucleus and make a precise point-to-point mapping from the olivary nucleus to the cerebellar cortex. There is, however, some indication from cell counts done in the olivary nucleus [6], that either each climbing fiber branches about 15 times before reaching the cerebellum, or the majority of climbing fibers come from other sources outside the olivary nucleus.

Information carried by climbing fibers comes from a great variety of areas. The inferior olive receives afferents from proprioceptive end organs as well as all lobes of the cerebral cortex. The inferior olive also receives a strong projection from the red nucleus and the periaqueductal gray via the central tegmental tract.

The response of climbing fibers to peripheral stimulation is quite distinct from that of mossy fibers. A climbing fiber will typically respond to pinching the skin and deeper tissue anywhere within a receptive field, which may encompass an entire limb [17]. In monkeys performing a motor task it has been observed that climbing fiber spikes are correlated with quick movements made in response to external stimuli, but not with self-paced movements, such as rapidly alternating wrist motions [18, 19]. This evidence would seem to indicate that information carried on climbing fibers is the product of a great deal of integration through higher centers.

In addition to the precise one-for-one climbing fiber contact with Purkinje cells, climbing fibers also put out three sets of collaterals; that is,

(1) a climbing fiber sends collaterals to synapse on basket cells and stellate cells in the immediate vicinity of the Purkinje cell that it contacts;

(2) a climbing fiber sends collaterals to one or more Golgi cells located within an elliptical region about 1000µm x 300µm  centered on the Purkinje cell that it contacts;

(3) a climbing fiber sends collaterals to nuclear cells in the cerebellar nuclei and in the Deiters nucleus.


H. Nuclear Cells

The nerve cells of the cerebellar nuclei and Deiters nucleus are of at least two types. One type is large multipolar neurons, with relatively simple and irregular dendritic arborisation. The axons from cells of the cerebellar nuclei go to the nucleus ventralis lateralis of the thalamus, to the red nucleus, to the pontomedullary reticular formation, and to the vestibular nuclei. Cells from the Deiters nucleus join the vestibulospinal tract. Thus some of these efferents send information toward the sensorimotor cortex, others toward the spinal motor neurons. The second type of nuclear neuron is smaller, with short axons, possibly a Golgi type II cell.

The cerebellar nuclei and Deiters nucleus cells receive excitatory inputs from climbing fiber collaterals and mossy fiber collaterals. They receive inhibitory inputs from Purkinje axons.




A. The Classical Perceptron

Since the neurophysiologist is usually not well versed in the field of pattern -recognition theory, a few short tutorial paragraphs concerning the pattern -recognition device known as the Perceptron are included to form a basis for arguments relating the cerebellum to the Perceptron. Again, rather than crediting all the many contributors to the theory of pattern-recognition and linear threshold devices, we refer the reader to the review books by Nilsson [14] and Minsky and Papert [13] for extensive references to the literature. These books contain mathematical proofs for most of the informal assertions made in following paragraphs.

The Perceptron developed by Rosenblatt [15] was inspired in large measure by known or presumed properties of nerve cells. In particular, a Perceptron possesses cells with adjustable-strength synaptic inputs of competing excitatory and inhibitory influences that are summed and compared against a threshold. If the threshold is exceeded,  the cell fires. If not, the cell does not fire. The original Perceptron was conceived as a model for the eye (see Fig. 4).

James Albus: A Theory of Cerebellar Function

Fig. 4


FIG. 4. Classical Perceptron. Each sensory cell receives stimulus either +1 or 0. This excitation is passed on to the association cells with either a +1 or -1 multiplying factor. If the input to an association cell exceeds 0, the cell fires and outputs a 1; if not, it outputs 0. This association cell layer output is passed on to response cells through weights Wi,j, which can take any value, positive or negative. Each response cell sums its total input and if it exceeds a threshold, the response cell Rj fires, outputting a 1; if not, it outputs 0. Sensory input patterns are in class 1 for response cell Rj if they cause the response cell to fire, in class 0 if they do not. By suitable adjustment of the weights Wi,j, various classifications can be made on a set of input patterns.

Patterns to be recognized, or classified, are presented to a retina, or layer of sensory cells. Connections from the sensory cells to a layer of associative cells perform certain (perhaps random, perhaps feature-detecting) transformations on the sensory pattern. The associative cells then act on a response cell through synapses, or weights, of various strengths. The firing, or failure to fire, of the response cell performs a classification or recognition on the set of input patterns presented to the retina.


B. Perceptron Learning

The Perceptron shows a rudimentary ability to learn. If a Perceptron is given a set of input patterns and is told which patterns belong in class 1 and which in class 0, the Perceptron, by adjusting its weights, will gradually make fewer and fewer wrong classifications and(under certain rather restrictive conditions) eventually will classify or recognize every pattern in the set correctly. The weights usually are adjusted according to an algorithm similar to the following.

  1. If a pattern is incorrectly classified in class 0 when it should be in class 1, increase all the weights coming from association cells that are active.
  2. If a pattern is incorrectly classified in class 1 when it should be in class 0, decrease all the weights coming from association cells that are active.
  3. If a pattern is correctly classified, do not change any weights.

Four features of this algorithm are common to all Perceptron training algorithms, and are essential to successful pattern recognition by any Perceptron-type device:

  • Certain selected weights are to be increased, others decreased.
  • The average total amount of increase equals the total amount of decrease.
  • The desired classification, together with the pattern being classified, governs the selection of which weights are varied and in which direction.
  • The adjustment process terminates when learning is complete.

The Perceptron works quite well on many simple pattern sets, and if the sensory-association connections are judiciously chosen, it even works on some rather complex pattern sets. For patterns of the complexity likely to occur in the nervous system, however, the simple Perceptron appears to be hopelessly inadequate. As the complexity of the input pattern increases, the probability that a given Perceptron can recognize it goes rapidly to zero. Alternatively stated, the complexity of a Perceptron required to produce any arbitrary classification, or dichotomy, on a set of patterns increases exponentially as the number of patterns in the set. Thus the simple Perceptron, in spite of it s tantalizing properties, is not practical as a realistic brain model without significant modification.

C. The Binary Decoder Perceptron

This lack of power of the conventional Perceptron can be overcome by replacing the sensory -association layer connections with a binary decoder, as shown in Fig. 5. It is then possible to trivially construct a Perceptron that will produce any arbitrary pattern classification. A binary decoder can be considered to be a recoding scheme that recodes a binary word of N bits into a binary word of 2N bits. This recoding introduces great redundancy into the resulting code. Each association cell pattern is restricted to a unique association cell in the 1 condition, all other association cells in the 0 condition. However, a binary decoder Perceptron is seldom seriously considered as a brain model for several reasons. First, the binary decoder requires such specific wiring connections that it is entirely too artificial to be imbedded in the rather random-looking structure of the brain. Second, the number of association cells increases exponentially as the number of inputs. Thus N input fibers require 2N association cells. Simple arithmetic thus eliminates the binary decoder Perceptron as a brain model.

James Albus: A Theory of Cerebellar Function

Fig. 5

FIG. 5. Binary decoder Perceptron. Each association cell firing uniquely corresponds to one of the possible 2N input patterns. This type of Perceptron can perform any desired classification of input patterns. It has, however, no capacity for generalizing.


D. The Expansion Recoder Perceptron

However, there does exist a middle ground between a simple Perceptron and a binary decoder Perceptron. Assume a decoder, or rather a recoder, that codes N input fibers onto 100N association cells, as shown in Fig. 6. Such a recording scheme provides such redundancy that severe restrictions can be applied to the 100N association cells without loss of information capacity. For example, it is possible to require that of the 100N association cells, only 1% (or less) of them are allowed to be active for any input pattern. That such a recoding is possible without loss of information capacity is easily proven, for  . That such a recoding increases the pattern-recognition capabilities of a Perceptron is certain, since the dimensions of the decision hyperspace have been expanded 100 times. The amount of this increase under conditions likely to exist in the nervous system is not easy to determine, but it may be enormous. It can be shown that . Thus 2N possible input patterns can be mapped onto 100N possible association cell patterns. If this is done randomly, the association cell patterns are likely to be highly dissimilar and thus easily recognizable. The ratio of 100N/2N = 50N rapidly increases as N becomes large.


James Albus: A Theory of Cerebellar Function

Fig. 6


FIG. 6. N → 100N Expansion recoder Perceptron. The association cell firing is restricted such that only 1% of the association cells are allowed to fire for any input pattern. This Perceptron has a large capacity and fast learning rate, yet it maintains the number of association cells within limits reasonable for the nervous system.


The restriction that only 1% of the association cells are allowed to be active for any input pattern means that any association cell participates in only 1% of all classifications. Thus its weight needs adjusting very seldom and there is a fairly good probability that its first adjustment is at least in the proper direction. This leads to rapid learning.



A. Pattern Recoding in the Cerebellum

The granular layer of the cerebellum takes in information on mossy fibers and puts out information on parallel fibers. There are from 100 to 600 times as many parallel fibers as mossy fibers. Thus the granule cells can be said to be association cells that recode information from N inputs to at least 100N outputs. What can be said about the nature of this recoding? It was already noted that no granule cell receives more than one excitatory input from any one mossy fiber. it was also noted that the mossy rosettes from a single mossy fiber were widely distributed over several folia with a rather uniform random distribution. Thus, by the central limit theorem of probability, the distribution of granule cells with any given number of excitatory inputs will approach a Gaussian distribution with B equal to the extent of the mossy rosette distribution. Since the mossy rosette distribution of each mossy fiber extends over several folia, the Gaussian curve will be flat, for all practical purposes, over regions large compared with a single folia, even more so compared with any individual cell.

Since virtually no granule cells are excited at two sites by the same mossy fiber the relative abundance of granule cells simultaneously excited by 17 active mossy fibers will be proportional to 1/n.

Thus at any instant the surface of the cerebellum should be dotted nearly uniformly randomly with granule cells whose input consists of one mossy fiber excitation. The surface of the cerebellum should also be dotted randomly, but less densely, with granule cells excited by two mossy fibers; and so on, progressively less densely with granule cells excited by three, and four, and five, up to seven mossy fibers. The total density of this dotting depends on the percentage of mossy fibers active.

The particular granule cells that actually fire as a result of various levels of mossy fiber excitation depend on the threshold levels of the granule cells. Only granule cells with enough excitatory inputs to exceed threshold will fire. This threshold for granule cells is regulated by Golgi cell activity.

The output of the granule cells is sampled by the Golgi cells via synapses with parallel fibers. This sampling is over an area approximately 250-650µm in diameter. Each Golgi cell feeds back inhibitory influences to about 100,000 granule cells. Neighbouring Golgi cells overlap extensively in their dendritic fields and in their axon arborisation. This very broad general feedback system suggests the function of an automatic gain control. Thus it is argued that the Golgi cells serve to maintain granule cell, and hence parallel fiber, activity fixed at a relatively constant rate. If few parallel fibers are active, Golgi inhibitory feedback decreases, allowing granule cells with lower numbers of excitatory inputs to fire. If many parallel fibers become active, Golgi feedback increases, allowing only those few granule cells with many active mossy inputs to fire.

The Golgi cells also have input from mossy fibers directly, a so-called feed-forward inhibition. This input tends to raise granule cell threshold levels when mossy fiber activity is large, and decrease granule thresholds when mossy fiber activity is small. This effect is also such as to stabilize the amount of parallel fiber activity.

To obtain a quantitative feel for what is occurring via these two types of Golgi cell inputs, consider Fig. 7. From the figure we can write

P = (M- Z + Sp)Gr                 (1)

z = (KM + P)Go                   (2)


  • P is the expected value of the spike rate for a parallel fiber,
  • M, the expected value of the spike rate for a mossy fiber,
  • I, the expected value of the spike rate for a Golgi cell,
  • Gr, the average transfer gain of granule cells,
  • Go, the average transfer gain of Golgi cells,
  • K, the relative strength of mossy fiber input on Golgi cells to that of parallel fiber input, and
  • Sp, the expected value of the spontaneous rate for a granule cell.

Combining (1) and (2) and differentiating with respect to M gives

dP/dM = Gr(1-KGo)/(1+GrGo)                                            (3)

From Eq. (3) it can be seen that by proper adjustment of parameters (i.e., KGo ≈ 1) it is possible to make P, the expected value of the spike rate for a parallel fiber, very nearly constant despite variations in mossy fiber input rate M.

It might not be unreasonable to assume values for Go and Gr as follows.

Gr = (1 granule spike)/(1 mossy spike) x (divergence of 100) = 100

Go = (1 Golgi spike)/(1000 parallel spikes) x (divergence of 100,000) = 100,000

These values substituted in (3) give

dP/dM ≈ (1-100K)/100                                             (4)

Thus if K ≈ 0.01 (i.e. 1 Golgi spike/105 mossy fiber  spikes), the expected value of parallel fiber activity rate P is nearly constant. This, of course, does not mean that parallel fiber patterns would be independent of mossy fiber patterns, but merely that the overall level of activity (i.e., spikes per second) of parallel fibers could be constant in spite of what percentage, or at what rate, the mossy fibers are firing.

The mossy fiber inputs to Golgi cells probably also serve to stabilize parallel fiber rates under transient conditions. The feedback path via parallel fibers involves delays. The feed-forward path is undoubtedly faster acting. The net result of Golgi cell activity seems therefore to be to stabilize the level of parallel fiber activity to a nearly constant value under all conditions.

It will thus be hypothesized that the surface of the cerebellum is dotted randomly with active parallel fibers and that the density of this activity is very nearly uniform, both spatially and temporally. It was noted earlier that if this density of parallel fiber activity is 1% or less, patterns are easily recognized and quickly learned. Furthermore, a 1% activity level is more than adequate from an information theory standpoint. Therefore, it will be further hypothesized that the density of parallel fiber activity is on the order of 1%.

James Albus: A Theory of Cerebellar Function

Fig. 7

FIG. 7. Parallel fiber rate control circuit.

  • M, expected value of mossy fiber input in spikes per second;
  • P, expected value of parallel fiber output;
  • I,expected value of Golgi cell rate:
  • Sp, expected value of spontaneous granule celrate;
  • Gr, transfer gain of granule cell network;
  • Go, transfer gain of Golgi cellnetwork;
  • K, relative strength of mossy fiber input on Golgi cells to that of parallel fiber input.


As was shown previously, recoding from N fibers to 100N fibers, under the restriction that only 1% of the output fibers are active for any input pattern, expands the number of possible patterns from 2N to about 100N, or an expansion of around 50N. In the cerebellum the number of input mossy fibers is approximately 5 x 104/mm2. Thus the pattern -expansion capacity of 1mm2 of cerebellar cortex is on the order of 5050000. Just what this means in increased pattern-recognition capability is unclear, but we get the feeling it is quite significant. This argument is even more compelling when it is realized that the mossy fiber system undoubtedly carries only a very restricted subset of the 2N (really RN where R is the number of distinguishable levels of fiber firing rate) possible input patterns. Thus the recoding from N fibers to 100N fibers may well produce an enormous increase in classification capability of cells in the cerebellum functioning as pattern-recognition response cells.

If this hypothesis of mossy fiber recoding by granule cells is correct, it implies that, to a neurophysiologist probing with an electrode, any parallel fiber should appear to fire uncorrelated with neighboring parallel fibers, at least in an unanaesthetised awake preparation. An intuitive feel for why this recoding process is advantageous can be obtained from a simple example. Consider a Perceptron with only two association cells. There are then at most four different patterns of association cell firings. Suppose now it is desired for the response cell to fire whenever a sensory pattern occurs that produces an association cell pattern of 01 or 10, and it is desired for the response cell not to fire for any association cell pattern of 00, and 11. Try as we might it is impossible to find any combination of weights that can cause the response cell to have this behavior. it is rather simple to make the response cell fire on 01, and 10, and to not fire on 00. However, the 11 pattern creates a problem.

If, however, an expansion recoder is put between the sensory cells and the association cells, so that there are, for example, five association cells, the problem is much easier. The sensory pattern that previously produced the association cell pattern:

01 now might produce 00100;

10 now might produce 01001;

00 now might produce 10000;

11 now might produce 00010.

It is trivial to adjust weights so that association cell patterns 00100 and 01001 cause the response cell to fire, and the patterns 10000 and 00010 cause the response cell not to fire. The training procedure would consist of at the most one adjustment for each pattern.

A computer simulation of this type of recoding process has been run for a more complicated case. Twenty (20) mossy fibers were modeled. An expansion recoder of mossy rosettes, granule cells, and Golgi cells was modeled that transformed 20 mossy fiber firing rates into 2000 granule cell firing rates. Golgi cell feedback restricted the granule cells so that only about 1% of them could fire. The result was that for two very similar mossy fiber patterns the granule cell firing patterns were similar in some respects but quite distinguishable in others. Some granule cells responded exactly the same for both mossy patterns, but other granule cells responded entirely differently. This implies that mossy fiber input patterns that would be very difficult to distinguish if put directly into a Perceptron response cell are easily distinguishable after passing through the pattern recoder.


B. The Purkinje Response Cell

It has been argued that the parallel fibers contain information coded in an ideal manner to serve as the input to a Perceptron response cell. It will now be argued that the Purkinje cells serve a function similar to Perceptron response cells.

From a purely structural standpoint, the Purkinje cell certainly is related to granule cells very similarly to the way a Perceptron response cell is related to association cells. Each Purkinje cell has an enormous fan-in; each granule cell has a large fan-out. It is hard to conceive a more efficient parts layout for this type of circuit than the parallel fiber-Purkinje dendrite arrangement. A flat tree with input fibers piercing it at right angles creates the maximum possible fan-in for each Purkinje cell. The flat, closely stacked Purkinje dendritic trees allow the maximum possible fan-out for each parallel fiber. Any other arrangement would almost certainly decrease the ratio of computational elements to the brain tissue mass.

We may reasonably ask why this same structure does not exist in the cerebral cortex. The answer may well lie in the differences between the functions required of the cerebrum and of the cerebellum. The portion of the cerebral cortex that is best understood from a functional standpoint is the visual cortex. Here it is well known that a great amount of feature detection [9] takes place, such as line detection, edge detection, motion detection, and binocular correlation. Many of these transformations are translationally invariant over certain fields of view; that k, cells in the visual cortex respond to certain global features of the visual input irrespective of small changes in retinal coordinate position. it would appear, then, that in the cerebrum considerable feature -detection processing precedes, and perhaps is intermingled with, the expansion recoding circuitry. The geometrical requirements of translationally invariant global feature detection require elaborate plexuses of fibers crisscrossing in the cerebral cortex, and cells with their dendritic fields geometrically positioned to extract feature-dependent inputs from these fiber plexuses. Any pattern recoding and pattern -recognition circuitry interspersed in this tangle would certainly be less compact and regular than that found in the cerebellar cortex.

On the other hand, in the cerebellum, granule cell receptive fields [17] show no evidence of feature detection analogous to that found in cerebral cortical cells. This is not too surprising since there should be no need for translationally invariant feature detection in a system that senses body conditions and controls motor commands. The problem of the cerebellum is merely to recognize patterns of information from proprioceptive receptors and to generate the appropriate motor command signals. The circuitry to do this is arranged as compactly as possible. The result is the beautiful regularity of the cerebellum.

Large portions of the cerebellum receive inputs from and project back toward the cerebral cortex. Since the anatomy of this portion of the cerebellum i s not appreciably different from the portion that interacts with the periphery, it is reasonable to assume that the transfer function is similar (i.e., a mossy fiber pattern input producing a Purkinje cell pattern output).

The nervous system has one constraint that does not exist in the Perceptron. In the nervous system a particular type of cell is either excitatory or inhibitory. Any single granule cell thus cannot be excitatory on one Purkinje cell and inhibitory on another. The basket and stellate b cells appear to provide a means of overcoming this deficiency. Basket and stellate b cells receive excitation from parallel fibers and inhibit Purkinje cells located transversely. This arrangement allows any parallel fiber to excite a number of Purkinje cells along its length, and to inhibit another group of Purkinje cells located on its flanks. As noted before, a parallel fiber is not likely both to excite a Purkinje cell directly and also to inhibit the same Purkinje via basket or stellate b cells. Thus, as shown in Fig. 8, the Purkinje cell looks very much like a Perceptron response cell. The only logical difference is that the inhibitory input to the Purkinje cell is collected and summed by flanking basket and stellate b cells before being relayed to the Purkinje cell. The inhibitory input of each basket and stellate b cell is also sent to many other Purkinje cells, but this fact is immaterial to any individual Purkinje. It is influenced only by the inputs it receives, not by the other places those inputs may go. In order to complete the analogy between Purkinje cells and Perceptron response cells, it is necessary to introduce adjustable synaptic strengths.

James Albus: A Theory of Cerebellar Function

Fig. 8

FIG. 8. Cerebellar Perceptron:

  • P, Purkinje cell;
  • B, basket cells;
  • S, stellate b cells.

Each Purkinje cell has inputs of the type shown.


C. The Hypothesis of Variable Synapses

The fundamental hypothesis of this article is that parallel fiber synapses are adjustable on both Purkinje cell dendrites and stellate and basket cell dendrites. The mechanism of change in both cases is hypothesized to be closely related to climbing fiber input activity. It will be argued that both excitatory and inhibitory influences on Purkinje cells are specifically modified under the control of climbing fiber activity patterns.

Each Purkinje cell is contacted by a single climbing fiber. In a conscious animal the climbing fibers fire in short bursts of one or more spikes at a rate of about 2 bursts/sec [5, 18]. Each climbing fiber burst causes a single spike on the Purkinje axon followed by a complex burst of spike-like activity in the Purkinje dendritic tree and intense depolarization of the Purkinje cell. The single axon spike is followed by a pause in the spontaneous Purkinje axon spike activity for 15-30ms. This pause, accompanied by intense depolarization, was first observed by Granit and Phillips [8] and was termed the inactivation response to distinguish it from a normal pause in activity resulting from hyperpolarization. After the 15- to 30ms inactivation response, the cell gradually recovers its spontaneous firing rate over a period of 100-300ms [3]. As it approaches normal, the cell becomes once again responsive to parallel fiber input activity.

It is now hypothesized that the inactivation response pause in Purkinje spike rate is an unconditioned response (UR) in a classical learning sense caused by the unconditioned stimulus (US) of a climbing fiber burst. It is further hypothesized that the mossy fiber activity pattern ongoing at the time of the climbing fiber burst is the conditioned stimulus (CS). If this is true, the effect of learning should be that eventually the particular mossy fiber pattern (CS) should elicit a pause (CR) in Purkinje activity similar to the inactivation response (UR) that previously had been elicited only by the climbing fiber burst (US). In order to accomplish this result it is necessary to postulate that the climbing fiber input to the Purkinje cell not only causes the Purkinje cell to pause momentarily but also weakens any parallel fiber synapses that are tending to cause the Purkinje to fire during the inactivation response.

A possible mechanism for such weakening might be that there exists a critical interval near the end of the inactivation response after the effect of the climbing fiber burst has worn off sufficiently so that the cell can be fired by parallel fiber input but before the dendritic membrane has returned completely to normal. If the Purkinje cell fires in this interval, this firing is an error signal that signals every active parallel fiber synapse to be weakened.

The amount of weakening of each synapse is proportional to how strongly that synapse is exciting the Purkinje cell at the time of error signal. The effect of this mechanism would be to train the Purkinje cell to pause at the proper times, that is, at climbing fiber burst times. After learning is complete, the Purkinje knows when to pause because it recognizes the mossy-parallel fiber pattern that occurred previously at the same time as the climbing fiber burst. Later, since each parallel fiber active synapse was weakened by the error signal, if the same mossy parallel fiber pattern occurs again, the Purkinje will pause even without the climbing fiber burst. Thus, the Purkinje is forced to perform in a certain way by the climbing fiber teacher. After learning is complete, however, it behaves in that same way, under the same mossy fiber conditions, even in the teacher’s absence.

Note that this mechanism corresponds closely with the Perceptron training algorithm in that (1) if the response cell fires (or tends to fire) when it should not fire, then all synapses coming from active parallel fibers will be decreased or weakened; (2) if the response cell does not fire improperly, no adjustments are made.

It is now possible to consider many climbing fibers, each firing at different rates in some spatial pattern C1, at time t1. This climbing fiber firing pattern will elicit a Purkinje firing pattern C’1. Assume at time t1, the mossy fibers have some firing pattern M1. Each climbing fiber will train its respective Purkinje cell (or cells) to recognize the mossy fiber input pattern M1 that was present when C1 occurred. If during training M1 on the mossy fibers occurs in coincidence with C, on the climbing fibers, after training the occurrence of M1 on the mossy fibers will elicit C’1 from the Purkinje cells whether or not C1 appears on the climbing fibers. It can then be said that climbing fiber pattern C1 has been imprinted, or stored, on mossy fiber pattern M1. In the same way a second climbing fiber firing pattern C2 can be stored on another mossy fiber pattern M1 and so on.

An important feature of this hypothesis is that the C’1 patterns coming out of the Purkinje cells are not necessarily binary patterns; C’1 represents the relative rates of firing of all the Purkinje cells. Thus relative patterns are stored and relative patterns are recalled.


D. Variable Inhibitory Synapses

Since variation of parallel fiber Purkinje cell synapses is sufficient to cause patterns to be stored in the cerebellum, we might well suggest [11] that no further mechanism of variable inhibitory synapses is necessary. However, there are good reasons to further hypothesize variable inhibitory synapses.

First, if only the excitatory inputs to a cell are caused to decrease, while the inhibitory inputs are held fixed, eventually the cell fails to fire in response to any input pattern. Second, a pattern -recognition device based on only excitatory weight adjustment has inherently low capacity. Marr [11] estimates that a Purkinje cell capable of only excitatory synaptic adjustment has the capacity to make about 200 mossy fiber pattern dichotomies. However, a Perceptron with both positive and negative weight adjustments has the capacity to make about twice as many dichotomies as there are adjustable weights [4]. Thus, if both excitatory and inhibitory synapse adjustment is possible in the cerebellum, each Purkinje cell would have the capacity to make on the order of 200,000 pattern dichotomies. The adjustment of inhibitory weights thus results in a thousand-fold increase in recognition capacity. Third, any pattern -recognition system capable of varying weights in only one direction is necessarily very slow to learn. An example of the learning difficulties encountered by such a system can be seen by referring to Fig. 4. Assume a pattern M causes only association cell A, to fire. This will affect the response cell R1 through weight W1,2.

Four possible situations can exist when pattern M is first presented:

case I M desired in class 1, R1 = 1;

case 2 M desired in class 1, R1 = 0;

case 3 M desired in class 0, R1 = 1;

case 4 M desired in class 0, R1 = 0.

In case 1 and case 4, M is already in the proper class and no adjustment of weights is necessary. In case 3, the weight W1,2 needs to be decreased so as to force the R1 cell below threshold. In case 2, the weight W1,2 needs to be made more positive so as to raise the RI cell above threshold. If such a positive adjustment is not allowed, another means is available. All the weights to R1 except can be decreased, and the threshold of the R1 cell somehow decreased accordingly. This would have the same result as an increase in W1,2. As a mechanism likely to occur in the cerebellum, however, this scheme has several serious difficulties:

  1. Decreasing all weights except one is cumbersome. It is inconceivable to decrease 199,999 weights in order to increase 1.
  2. It is very difficult to suggest a mechanism with such abilities. The mechanism must, in case 3, decrease the synaptic strength of all active parallel fibers, but in case 2, decrease the synaptic strength of all except the active parallel fibers.
  3. If the threshold of the R1 cell is to be lowered along with all the weights except W1,2, this in itself implies that variable inhibitory synapses are necessary in the cerebellum.
  4. If basket and stellate cells have no variable synapses, it is hard to imagine why they are so numerous, or what is the purpose of their peculiar axon distributions. If these inhibitory interneurons merely serve the purpose of general threshold regulators, it would seem that a few cells should do as well. For example, only a few Golgi cells are necessary to set general threshold levels for an enormous number of granule cells. Yet there are about twice as many basket and stellate by cells as Purkinje cells. Surely these cells have a more sophisticated function than general threshold regulation. Variable inhibitory synapses could explain why basket and stellate cells are so numerous.


E. Site of Inhibitory Synaptic Change

Inhibitory synaptic strength variation could occur at two sites. One site is where basket and stellate b cells synapse on the Purkinje cells. This i s perhaps an obvious first candidate. However, the amount of convergence i s small. Certainly less than 1000 different basket and stellate b cells synapse on each Purkinje. The actual figure is probably less than 100. This is a far cry from the parallel fiber convergence of about 200,000 variable excitatory synapses. The addition of 100 variable inhibitory synapses would seem to add little to the recognition capacity of the Purkinje cell.

The second site where inhibitory inputs to Purkinje cells might be varied is at the parallel fiber synapses on basket and stellate b dendrites. A decrease in strength of the excitatory parallel fiber synapses on basket and stellate b cells results in a decrease in inhibitory input to the related Purkinje cells. The basket and stellate b dendritic trees are sparser than those of Purkinje cells, but they do contact perhaps 5% of the parallel fibers coursing through them. When account is taken of the fact that about 100 of these cells then synapse on a single Purkinje, the result is a convergence of variable inhibitory inputs to the Purkinje cell of the same order of magnitude as that of variable excitatory inputs. Thus the Purkinje recognition capacity is on the order of 200,000 patterns rather than 200 patterns as suggested by Marr [11].

It is interesting that lower forms, such as frogs, have no basket cells. A cerebellar Perceptron with no variable inhibitory weights is certainly possible. Its only shortcoming would be a very limited capacity for discrimination.

Several other facts support the hypothesis that the parallel fiber synapses on basket and stellate b cells are the sites of variable inhibitory weights. First, the basket and stellate cells contact the parallel fibers with dendritic spines similar to those of the Purkinje cells. Second, each climbing fiber, in addition to synapsing strongly on a single Purkinje cell, also sends collaterals, which synapse on the soma of adjacent basket and stellate cells. Since the climbing fiber input is assumed to be intimately related with varying parallel fiber synapses on Purkinje cells, it is perhaps reasonable to suggest that the same climbing fiber may also vary parallel fiber synapses on basket and stellate cells. The mechanism of variation could be identical or at least very similar. In other words it is argued that on every cell contacted by an active climbing fiber, each active parallel fiber synapse is weakened by the same mechanism regardless of whether the cell is Purkinje, basket, or stellate b. This hypothesis has the elegant feature that a single event causes a change in both excitatory and inhibitory influences. The fact that climbing fibers do not contact dendrites of basket and stellate cells may be accounted for by the fact that their dendritic arborisation is less extensive than that of Purkinje cells.

In order to satisfy the Perceptron training conditions that excitatory and inhibitory changes be equal on the average, it is merely necessary to assume that the size of the decrement in each synapse is such that the expected value of the excitatory change be equal to the expected value of the inhibitory change.


F. Pattern Storage on Excitatory and Inhibitory Synapses

The effect in terms of pattern storage of this scheme can be seen by referring to Fig. 9. Assume the climbing fiber firing pattern cf1 = 1, cf2 = 0 occurs. In this case P1 pauses and P2 is released from inhibitions by B, pausing. Further, assume a mossy fiber pattern occurs such that Pf1 = 1, Pf2 = 1. The coincidence of these two patterns will tend to decrease weights WP1 and WB1 but leave unchanged WP2 and WB2. At a later time when the climbing fibers are silent, cf1 = cf2 = 0; if the same mossy fiber pattern recurs such that Pf1 = Pf2 = 1, P1 will pause because of decreased WP1 and P2 will be disinhibited because of decreased WB1. Thus, the original climbing fiber response, P1 pause, P2 disinhibited, can be recalled by the mossy fiber pattern, which causes Pf1 = Pf2 = 1. It can thus be said that the climbing fiber pattern is imprinted on the mossy fiber pattern.

Note that all the adjustment of the variable synapses takes place in the immediate vicinity of the Purkinje cell excited by an active climbing fiber, even though the disinhibitory effects are felt by Purkinje cells far removed in the transverse direction.

In order to satisfy the requirement that the expected value of the change in excitation equals the expected value of the change in inhibition it i s necessary to assume some things concerning the relative amount by which WP1 and WB1 are changed. The synapse of Pf1 on P1 occurs with a probability of nearly 1. The synapse of Pf1on B1 occurs with a probability of around 0.05 or less. However, the effects of WB1 are distributed to 30-50 Purkinje cells, whereas the effects of WP1 are confined to one Purkinje cell. In addition, the strength of WB1 is multiplied by a  gain factor governed by the strength of the basket cell synapses on Purkinje cells. Since this is a rather strong synapse, the gain factor is probably greater than 1. Thus in order for the total average decrease in excitation to equal the total average decrease in inhibition, the following equation must be satisfied.

ΔWB1 x PB1(Pf1)  x DB1 x GB1 = ΔWP1 x PP1(Pf1)    (5)


ΔWB1 is the change in WB1,

ΔWP1 the change in WP1,

PB1(Pf1) the probability B1 contacts Pf1,

PP1(Pf1) the probability P1 contacts Pf1,

DB1  the number of Purkinje cells B1 contacts,

GB1 the strength of B1 synapse in Purkinje cells.



James Albus: A Theory of Cerebellar Function

Fig. 9

FIG. 9. Climbing fiber input.  Each climbing fiber contacts a single Purkinje cell and several nearby basket cells or stellate cells, or both. If Pf1 is active when P1 or B1, or both, fire in the critical interval during a cf1 inactivation response, then WP1 or WB1, or both, are altered. This change in synaptic strength can later be read out in the form of Purkinje postsynaptic potentials by firing Pf1 again.


Everything considered, it is likely that ΔWB1 is less than ΔWP1. This judgment seems to be supported by the experimental fact that the effect of a climbing fiber on a basket cell is less strong than on a Purkinje cell [5]. Presumably a smaller climbing fiber effect produces less synaptic weakening.

This cerebellar system now has most of the characteristics of a Perceptron; that is, it corrects errors by adjusting weights positively and negatively; the average total increase equals the average total decrease; the pattern being stored, in coincidence with the pattern on which it is stored, governs which weights are increased and which are decreased; and the adjustment procedure terminates as learning asymptotically approaches completion. In addition, the hypothesized cerebellar system exhibits the capacity to store information concerning the relative firing rates of climbing fiber patterns.


F. Defense of the Synaptic Weakening Argument

The argument synaptic weights are weakened by learning rather than strengthened is counter-intuitive and contrary to most, if not all, theories of synaptic learning that have appeared in the literature. Thus it perhaps should be examined in more detail. There are three main reasons why synaptic weakening rather than strengthening is hypothesized to take place in the cerebellum.

First, the experimental data that are available seem to suggest it. Climbing fiber inputs cause Purkinje cells to pause. If the Purkinje is to learn to pause, parallel fiber excitation must be decreased.

Second, Perceptron theory proves that the most effective training algorithms are error correcting in nature. Thus, firing at erroneous times should reduce the tendency to fire again.

Firing at the proper times requires no adjustment. This algorithm implies weakening of synapses that contribute to erroneous firings. It is possible to conceive an error correcting scheme that would operate by strengthening synapses but the mechanism seems quite unlikely. There are only two possible error conditions:

  • Cell fires when it should not. This condition can be corrected by weakening erroneous excitatory synapses (as suggested) or by strengthening erroneous inhibitory synapses. On the Purkinje cell the excitatory spine synapses seem much more likely candidates for variability than the inhibitory synapses. There are relatively few inhibitory synapses. Learning capacity would be quite low if on the Purkinje the inhibitory synapses rather than the excitatory were the site for variability.
  • Cell does not fire when it should. This condition can be corrected by strengthening erroneous excitatory synapses or by weakening erroneous inhibitory synapses. In this case it is difficult to suggest how the individual synapses know when an error has occurred. The absence of postsynaptic cell firing may be the correct response as far as each synapse knows. An additional piece of information is needed-the information that an error has occurred. It is difficult to imagine how this information is conveyed to synaptic sites in the absence of postsynaptic activity. Thus, if the Purkinje cell learns by error correction, the most probable mechanism is synaptic weakening in the presence of erroneous firing.

The third reason synaptic weakening is hypothesized to occur in the cerebellum is that there are serious stability problems of learned responses under conditions of overlearning if synaptic activity causes synaptic facilitation. Consider Fig. 10: C1 and C2 are climbing fibers synapsing with synapses of fixed strength on Purkinje cells P1 and P2. A parallel fiber pf synapses on P1 and P2with variable-strength synapses of weights W1 and W2. If it is now assumed that the synaptic weights  are strengthened by coincidence of pre- and postsynaptic activity, it is possible to write

ΔiW1 = fP1  . fpf  at t = i                                (6)


ΔiW1 is the increase in W1 at time t = i,

fP1  the frequency of spikes on P1, and

fpf  the frequency of spikes on fp.

Let W1 originally equal 0W1. As learning takes place, the following situation obtains. At

t = 0, fP1 = kfC1 + 0W1 fpf;

t = 1, fP1 = kfC1 + (0W1 + Δ0W1)fpf;

t = 2, fP1 = kfC1 + (0W1 + Δ0W1 + Δ1W1)fpf;

t = 3, fP1 = kfC1 + (0W1 + Δ0W1 + Δ1W1 + Δ2W1)fpf;

.                               .

.                               .

.                               .

James Albus: A Theory of Cerebellar Function

Fig. 10

FIG. 10. Two Purkinje cells contacting the same parallel fiber.

We can readily see that the weight W1 continuously increases at each learning interval. In fact, since ΔiW1 is the product of fP1·fpf, and since fP1 increases during each learning interval, Δ0W1 < Δ1W1 < Δ2W1  < ··· . Therefore W1 grows at an exponential rate, and of course so does fP1. Certainly W1 must eventually saturate. Now suppose that during the same learning sequence a spike train also appears on C2 at half the frequency of that on C1:

fC2 = ½ fC1.

Until W1 saturates,

W1 ≈ 2 W2

Eventually, however,

W1 = W2 = saturation value.

Thus, after a sufficiently long period, all parallel fiber synapses will eventually become saturated. The  very active ones will saturate first, but over a long time virtually every synapse will saturate. Synaptic facilitation suggests learning is exponential. Synaptic weakening suggests learning is asymptotic.

This problem could possibly be averted by proposing some sort of decay rate for all synaptic strengths. Thus synaptic strengths would not remain saturated. However, such a mechanism would need to be very exotic to prevent continued learning from degrading performance and, at the same time, to preserve learned patterns over long time periods. It is common experience that memories of motor skills are preserved rather well over periods of many years. It is also common experience that repeated practice of motor skills leads to improved motor performance, even when the practice sessions are intensive and of short duration (on the order of minutes or hours). It is difficult to conceive of a decay system that could preserve memory over periods of years and at the same time prevent saturation over periods of minutes.

It is an obvious fact that continued training in motor skills improves performance. Extended practice improves dexterity and the ability to make fine discriminations and subtle movements. This fact strongly indicates that learning has no appreciable tendency to saturate with overlearning. Rather, learning appears to asymptotically approach some ideal value. This asymptotic property of learning implies that the amount of change that takes place in the nervous system is proportional to the difference between actual performance and desired performance. A difference function in turn implies error correction, which requires a decrease in excitation upon conditions of incorrect firings.

This argument is not meant to suggest that synaptic facilitation does not occur anywhere in the nervous system. In fact the stellate a cells will shortly be conjectured to undergo synaptic facilitation. Synaptic facilitation very probably plays an important role in many places in the nervous system. However, in situations where saturation would degrade performance, and particularly in the cerebellar cortex, where other evidence points to weakening, synaptic weakening seems very likely to be the principle learning mechanism. It might be argued that the saturation argument holds equally well in the opposite sense, that is, that all synapses would eventually be reduced to zero. One answer to this is that the synaptic strengths tend toward zero asymptotically. Therefore the weaker a synapse becomes, the less is its contribution to any erroneous firings and the less it is weakened by any correction. Another answer is that new variable spiny synapses may be hypothesized to spontaneously and randomly grow and mature into active effection synapses. The result of this would not be to destroy learning but to mask it over a period of time by background noise. To clarify this point, no synapse that has undergone any decrementing is hypothesized to grow back in strength. However, new synapses are hypothesized to grow to full size and then mature into an effective state. From this point they are then decremented, perhaps all the way to zero. There may be some evidence for such a phenomenon in the visual cortex of the mouse. Ruiz-Marcos and Valverde [16] note that the density of spines on pyramidal cells in mouse visual cortex rises to a maximum shortly after the mouse opens its eyes. From that time the density of spines decreases asymptotically to a smaller value. Light deprivation considerably reduces the spine density. This might suggest that spines develop randomly under tropic influence of presynaptic nerves and are specifically decremented in the process of learning.


G. Response Speedup via Stellate a Cells

The notion that occurrence of a particular mossy fiber pattern causes a decrease in excitation of Purkinje, basket, and stellate b cells, and that this decrease in excitation causes the proper response of the Purkinje cell, raises a question of response speed. The decrease in excitation resulting from a decay of synaptic transmitter substance is not generally considered to occur as quickly as a build-up of excitation resulting from release of transmitter substance. Thus a system that operates solely on decay of excitation may lack the speed necessary for quick movements. It will now be suggested that stellate a cells are ideally situated for providing a speedup mechanism.

The main structural difference between stellate a and stellate b cells is in their axon arborisation. The stellate a cells send synaptic contacts to Purkinje cells in their immediate vicinity and to adjacent Purkinje cells in the longitudinal direction. Thus it is quite likely for a parallel fiber to excite a particular Purkinje cell and to inhibit the same Purkinje via a stellate a cell. Climbing fiber collaterals also contact stellate a cells. Thus, following the same reasoning used for Purkinje, basket, and stellate b cells, it is not unreasonable to assume that coincidence between climbing fiber and parallel fiber activity effects a change in synaptic strength of stellate a cells also. It would seem, however, that in order to perform a useful function, the synaptic change in this case should be a strengthening rather than a weakening. It will be conjectured that coincidence of a climbing fiber spike with parallel fiber activity on a stellate a cell will cause an increase in the synaptic strength of the parallel fiber-stellate a cell synapse. Thus the stellate a synapses are conjectured to change in the opposite direction from all the other variable synapses under the same coincidence conditions.

Consider parallel fiber pattern M1 to be imprinted positively on stellate a cells, but negatively on an immediately adjacent Purkinje cell. Occurrence of pattern M1 causes the Purkinje cell to receive less excitation. Pattern M1 causes the stellate a cell to receive more excitation, and hence actively inhibit the Purkinje. The result would be an increase in speed of the Purkinje cell response.

The stellate a cell variable synapses would of course be subject to the saturation problem discussed previously. However, if the stellate a contribution to the Purkinje input were small compared to the other inputs from basket and stellate b cells and parallel fibers, the saturation effect would be small in the steady state. The stellate a input would be significant only in the first few milliseconds following a transient. In this interval the stellate a cell would get the Purkinje response going in the proper direction. Later the other inputs to the Purkinje would predominate to set the proper final value. The same effect would obtain if the stellate a response were not necessarily small but merely of short duration.

Note that in the arguments concerning stellate a cells the word conjecture was used rather than hypothesis. Very little is known concerning the behavior of stellate a cells and any confident prediction concerning their function is certainly premature. Stellate a cells may have nothing at all to do with memory or variable synapses. In the next section it is suggested that perhaps stellate a cells may have rather to do with attention mechanisms.


H. The Function of Recurrent Purkinje Collaterals

The fact that the cerebellum is spontaneously active allows it to achieve a high degree of sensitivity and precision. A spontaneously active system is essentially linear, at least for small inputs. Thus any small input will produce an output whose size will depend on both the size of the input and the gain of the system. I f the system is not spontaneously active, small signals do not have any effect on the output until they exceed a certain threshold. This is usually not a desirable trait for a feedback control system.

As was discussed earlier, the mossy fiber ? granule cell ? Golgi cell interconnection network appears to work so as to maintain granule cell activity at some relatively constant level. In addition, the Purkinje cell axons put out recurrent collaterals that are known to contact Golgi cells, basket cells, and other Purkinje cells. These Purkinje recurrent collaterals send inhibitory impulses over a wide-ranging area, even into adjacent folia. The Purkinje recurrent collateral synapses on other Purkinje cells have the effect of maintaining the average Purkinje cell activity fixed at a relatively constant level over the entire cortex. If the average Purkinje activity rises too high, the inhibitory effect of the recurrent collaterals drives it back down. If Purkinje cell activity drops too low, the decrease in inhibition will let it rise again. Thus a relatively constant spontaneous discharge rate will be maintained despite rather large variations in cell conditions, such as nutrition or fatigue.

Another effect of the recurrent collateral inhibition on Purkinje cells is the contrast enhancement effect of lateral inhibition. Thus any local increase in activity will be accompanied by a surrounding field of depressed activity. There also appears to be some specific contralateral inhibition produced by Purkinje recurrent collaterals.

The existence of Purkinje recurrent collateral synapses on Golgi cells is very interesting. The effect is that of both positive and negative feedback since the affected parallel fibers both excite the Purkinje cells directly and inhibit them via basket and stellate cells. The total effect may be that when a general area of the cerebellar cortex is actively engaged in processing information, the Golgi cells limiting the input to that area are suppressed, thus allowing input to that area more free access. This would then constitute a crude form of attention mechanism. Any area actively engaged in processing information would be given priority over other areas that are inactive at the time. This of course is quite speculative, but a rather pregnant possibility.

The function of Purkinje recurrent collateral synapses with basket cells is not clear. The effect is certainly that of positive feedback. Positive feedback is commonly used in electronic circuitry to produce one or the other of two effects: either oscillatory behaviour or bistable switching behaviour. There is no evidence of any oscillatory effects in the cerebellum that are likely to be mediated by Purkinje recurrent collaterals. There is, however, a curious bistable effect in the firing rate of Purkinje cells that may be caused by the Purkinje recurrent collateral interaction with the various interneurons. Although a Purkinje cell sometimes is spontaneously active, at other times the same cell is completely quiet except for climbing fiber responses. This rather implies that Purkinje cells have at least two stable states, one spontaneously active, the other completely silent. The transition between states seems to be somewhat correlated with climbing fiber activity [3]. We might speculate that certain parts of the cerebellum are switched on by an attention mechanism when they are needed, and switched off again when they are not in use. The Purkinje collateral – basket cell or Golgi cell circuit may provide the positive feedback necessary to switch between states. Specific climbing fiber patterns could provide the trigger signal to initiate the switching. Climbing fiber inputs to Golgi cells may be the means by which climbing fibers trigger Purkinje cells into an active state. Climbing fiber inputs to stellate a (or basket and stellate b) cells might trigger Purkinje cells into a quiet state. Although these notions are admittedly tenuous, such activity certainly is characteristic of control systems far less complex than the brain. it should not be surprising if similar behavior is found in the brain.


I. Effects of the Intracerebellar Nuclei

It must be emphasized that details of the microstructure in the intracerebellar nuclei are much less well defined than in the cerebellar cortex. Even less is known about detailed interactions and pathways outside the cerebellum altogether. However, it is felt that the following type of argument must eventually be made before the function of the cerebellum can be said to be understood.

James Albus: A Theory of Cerebellar Function

Fig. 11

FIG. 11. Interaction between the cerebellar cortex and nuclear cells. Mossy fibers act on Purkinje cells, which act as modified Perceptron response cells. Mossy fibers, climbing fibers, and Purkinje axons all interact in nuclear cells.

Nuclear cells in the cerebellar and Deiters nuclei are contacted by collaterals from mossy fibers, collaterals from climbing fibers, and Purkinje axons. Thus circuits of the type shown in Fig. 11 probably exist.

The frequency of firing of the Purkinje cell is of the form

fP = fckcP – Xi(fm1,fm2,fm3, . . .,fmN) +f0P                        (7)


fP is firing rate of Purkinje cell,

fc firing rate of climbing fiber,

fcP is the climbing fiber input-Purkinje cell output transfer function,

Xi(fm1, …,fmN) is the input to the Purkinje of a learned pattern Mi of mossy fiber inputs (the sign is negative since the Purkinje learns to pause), and,

f0P is steady -state rate of Purkinje.

The firing rate of the nuclear cell, which is also spontaneously active, is given by

fN = fckcN – fPkPN + fm1kmN + f0N                      (8)

where kP is the spontaneous firing rate of the nuclear cell and kcN is the climbing fiber input-nuclear cell output transfer function. Substitution of (7) in (8) gives

fN = fc(kcN – kP) + fmkmN + Xi(fm1, …,fmN) + f0            (9)

where kP, is the combined effect of kPN and kcP and f0 is the combined effect of f0p and f0N.

Several interesting observations can be made from Eq. (9). First, the output of the nuclear cell is directly affected by mossy fiber input. Thus the nuclear cell may be part of a reflex arc. Second, the strength of this reflex arc is modulated by patterns arriving on the mossy fibers corresponding to patterns previously stored by climbing fibers. Third, the effect of climbing fiber activity fc on the nuclear cell depends on the factor (kcN – kp); kP is a negative quantity since kPN, the effect of the Purkinje on the nuclear cell, is inhibitory, and kcP, the effect of the climbing fiber on the Purkinje, is the inactivation response. Thus the factor (kcN – kP) is always positive.

Since the climbing fiber pattern is stored in the Xi pattern, the effect of the mossy fiber Xi pattern associated with the climbing fiber pattern reinforces the climbing fiber’s effect on the nuclear cell. Thus, as learning takes place, less and less input from the climbing fiber is necessary to produce the same amount of nuclear cell response. Fourth, the effect of an input on mossy fibers through the function Xi(fm1, …,fmN) is a positive response. The Xi function in (7) decreases the output of the Purkinje cell and hence in (9) increases the output of the nuclear cell.



It is reasonably certain that patterns of activity on mossy fibers represent to the cerebellum the position, velocity, tension, and so on of the muscles, tendons, and joints. This is feedback information that is required to control precise or sequential movements, or both. This information must modulate signals to the muscles to achieve precise movement under varying load conditions. This feedback information must also be able to generate the next command in a sequence of muscle commands in order to produce sequential motor activity at a subconscious level. The functioning of the cerebellum, as hypothesized in this article, seems rather well suited for either or both of these behaviors.

Assume, for example, that the red nucleus sends a command C1 through the inferior olive and thence via climbing fibers through Purkinje cells and nuclear cells to the muscles. At this time the muscles and joints in their resting state are sending pattern M1 to the cerebellum via mossy fibers. Thus C1 is imprinted on M1. Now when C1 reaches the muscles, they respond by moving to a new position. This generates a new mossy fiber pattern M2. By this time a second command C2 is sent from the red nucleus. Command C1 will be imprinted on M2. In a similar manner C3 is imprinted on M3, C4 on M4, and so on. This process may be continued for a lengthy sequence of motor commands C1C2C3… and resulting body positions M1M2M3… . Upon repetition of the sequence of motor commands C1C2C3…, the signals from the red nucleus will be reinforced at the nuclear cells by output from Purkinje cells responding to feedback mossy fiber patterns M1M2M3… . Upon each repetition more and more of the muscle control can be assumed by the output of the Purkinje cells, and less attention is required by higher motor centers.

Once learning is complete, the sequence of motor commands C1C2C3C4 can be elicited entirely from the Purkinje cells via the mossy fiber input patterns M1M2M3M4… . Little input is required from higher centers except perhaps to initiate or terminate the sequence.

The theory so far has no means of initiating or terminating such a sequence. it is possible that this operation takes place in the intra-cerebellar nuclei or outside the cerebellum altogether. Lack of detailed anatomical and physiological data makes it difficult to conjecture how this function is accomplished. However, it is perhaps not unreasonable to speculate that the Schiebel collaterals of climbing fibers to Golgi cells or stellate a cells, or to both, may be related to initiation or termination of sequence generation in the cerebellar cortex. The Golgi cells control the mossy fiber input pathway, which is a vital link in sequence generation. Excitation of Golgi cells via Schiebel collaterals could cut off mossy fiber input to the cerebellum and terminate a sequence. Inhibition of Golgi cells by Purkinje recurrent collaterals, on the other hand, would lower Golgi inhibition, possibly in response to specific patterns. This might initiate sequences upon certain key commands. Golgi cells may also have variable synapses, since they possess both spine synaptic contacts with parallel fibers and input from climbing fibers. However, more data are necessary before confident predictions are possible on these points.

The circuit described can also function as a modulator of conscious motor activity on climbing fibers. Assume that a sequence of motor commands from higher centers C1C2C3… had been imprinted on a series of mossy fiber patterns M1M2M3… as before. If the muscles upon receipt of conscious command C1 were to encounter greater than usual resistance, this would delay or prevent the appearance of M2 at the cerebellum,  and instead a pattern M’2 would appear, signalling the existence of extraordinary resistance to motion. The pattern M’2 would modify pattern C1 in a manner different from M2, perhaps calling for additional force or some other modification. What M’2 produces is governed by what previously had been imprinted on M’2. If previously C’2 an additional force command, had been imprinted on M’2, the C’2 would be substituted for C2, automatically when the M’2 feedback signal was received instead of the usual M2. By this means a sequence of conscious commands can be modified at the reflex level by cerebellar activity. This perhaps is the means by which motor activity such as running or skating can be under conscious control in a general sense but under reflex feedback control at the individual muscle level.

The implication, then, is that climbing fibers carry from higher centers control patterns that are to be stored. In this form the cerebellar memory becomes a form of conditioned reflex. If the climbing fibers are cut, we would expect deficiencies primarily in conscious motor control and further conditioning. This may in some measure account for data of Mettler [12], which noted a lack of obvious severe effects when climbing fibers were cut.

Marr [11] suggests an interesting analogy of the cerebellum as a language translator between data in the cerebrum and command sequences needed by the muscles. The cerebellum thus becomes analogous to a computer compiler that translates source language instructions into machine language instructions for execution by the machine hardware. Following the same analogy, the cerebellum becomes a subroutine library in which subroutines can be stored from above and cycled from below.




The theory of cerebellar function set forth in this article makes possible a number of predictions that are subject to experimental verification:

  1. Parallel fibers do not fire in coordinated beams in a conscious active animal, but rather in a widely scattered, apparently random fashion.
  2. One percent or less parallel fibers are active simultaneously, and this activity level is quite constant.
  3. Parallel fiber synapses with dendritic spines on Purkinje cells, basket cells, and stellate cells are modifiable synapses.
  4. The Purkinje cell response can be conditioned by climbing fiber inputs. Climbing fiber spikes are the unconditioned stimulus (US). Mossy fiber activity patterns are the conditioned stimulus (CS). The climbing fiber inactivation response is the unconditioned response (UR).
  5. The conditioning mechanism is a three-way coincidence between the inactivation response, a cell spike due to parallel fiber excitation, and parallel fiber synaptic activity.
  6. Parallel fiber synapses on Purkinje cells, basket cells, and stellate b cells are weakened by incorrectly firing during climbing fiber activity.
  7. Climbing fibers are essential for acquisition of certain types of motor skills, and for cerebellar feedback control of conscious motor activity. They are less necessary for conditioned reflex behaviour.
  8. Some of the mechanisms hypothesized in the cerebellum will almost certainly also occur in other parts of the brain. The expansion recoding system; the imprinting of patterns from specific fiber inputs onto synapses of nonspecific fibers; the use of laterally coursing inhibitory interneurons to achieve both positive and negative synaptic weight adjustment; the weakening of synaptic weights during training to achieve convergence; these are all basic principles of data processing likely to occur elsewhere in the nervous system.



The author thanks Mr. Anthony J. Barberra for his valuable criticism and suggestions.



  1. S. Albus, A model of memory in the brain, Cyberneticus (1970) (in press).
  2. S. Cajal, Histologie du systeme nerveux de l’homme et des Vertebres, Tome II. Maloine, Paris, 1911.
  3. D. Bell and R. J. Grimm, Discharge properties of Purkinje cells recorded on single and double microelectrodes, J. Neurophysiol. 32(1969), 1044-1055.
  4. M. Cover, Classification and generalization capabilities of linear threshold units, Rome Air Development center Tech. Documentary Rept. RADC-TDR-64-32(1964).
  5. C. Eccles, M. Ito, and J. Szentagothai, The cerebellum as u neuronal machine. Springer, Berlin, 1967.
  6. Escobar, E. D. Sampedro, and R. S. DOW, Quantitative data on the inferior olivary nucleus in man, cat and vampire bat, J. Comp. Neurol. 132(1968), 397433.
  7. A. Fox, D. E. Hillman, K. A. Sugesmund, and C. R. Dutta, The primate cerebellar cortex: A Golgi and electron microscope study, Progr. Brain Res. 25(1967), 174-225.
  8. Granit and C. G. Phillips, Excitatory and inhibitory processes acting upon individual Purkinje cells of the cerebullum in cats, J. Physiol. (London) 133(1956), 520-547.
  9. H. Hubel and T.N. Wiesel, Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex, J. Physid. (London) 160(1962), 106154.
  10. Jakob, Das Kleinhim, in Handbuch der mikroskopischen Anatomie des Menschen IV/I (W.V. Mollendorf, ed.). Springer, Berlin, 1928.
  11. Marr, A theory of cerebellar cortex, J. Physiol. (London), 202(1969), 437-470.
  12. A. Mettler, (1967), In a discussion following a paper by J. C. Eccles in Neurophysiological basis of normal and abnormal motor activities (M. D. Yahr and D. P. Purpura, eds.), pp. 411-414, Raven Press, N.Y., 1967.
  13. Minsky and S. Papert, Perceptrons: An introduction to computational geometry. MIT Press, Cambridge, Massachusetts, 1969.
  14. J. Nilsson, Learning machines: Foundations of trainable pattern–classifying systems. McGraw -Hill, New York, 1965.
  15. Rosenblatt, Principles of neurodynamics: Perceptrons and the theory of brain mechanisms. Spartan Books, Washington, D.C., 1961.
  16. Ruiz-Marcos and F. Valverde, Temporal evolution of the distribution of dendritic spines in the visual cortex of normal and dark-raised mice, Exptl. Brain Res. 8(1969), 284-294.
  17. T. Thach, Jr. Somatosensory receptive fields of single units in cat cerebellar cortex, J.Neurophysiol. 30(1967), 675-696.
  18. T. Thach, Discharge of Purkinje and cerebellar nuclear neurons during rapidly lternating arm movements in the monkey, J. Neurophysiol. 31(1968), 785-797.
  19. T. Thach, Discharge of cerebellar neurons related to two maintained postures and two prompt movements, 11: Purkinje cell output and input, J. Neurophysiol. 33(1970), 537-547.







Posted in Uncategorized | 1 Comment

About 2.0

The topic of posts on this blogsite have become rather narrower than the ‘neuroscience, technology, philosophy’ tagline would suggest. Here, I set out my stall with something a bit more informative than the references to Greek mythology and German literature you will find on the ‘About 1.0’ page.

The site considers various age-old philosophical problems (of consciousness, of free will, of morality, of knowledge, or science) but from a neuroscientifically-oriented standpoint.


A Physicalist Worldview

It takes a ‘physicalist’ stance:

  • There is only physical ‘stuff’ (such as matter).
  • Consequently, there is a gradual transition rather than a sharp distinction between self and non-self (the ‘environment’).

This is in contrast with the ‘traditional’ dualist view:

  • The realms of ‘mind’ and ‘matter’ are separate.
  • Consequently, there is a sharp distinction between the two.

Dualist ideas are as good as obsolete among neuroscientists but are still dominant among the general population and they underpin religious beliefs.

I have made an analogy in some posts, between:

  • Dualism: an old house with subsidence, and
  • Scientific physicalism: a brand new house, but still under construction.

It is becoming increasingly difficult for religiously-inclined people to ignore the cracks in the walls of the old house. Scientifically-inclined people feel superior with their new building but they generally do not recognise that their home is incomplete. The aim here is to look at how the new house might look when completed, so we can then judge if it is better than what came before – morally as well as scientifically.


A Simple Theory of the Brain

The ‘latest’ science is used to inform a ‘latest’ philosophy – a suggestion of what future generations might accept as normal. It is often the case that ‘better’ explanations are unsatisfactory to those who have grown up with a different worldview but are accepted almost unquestioningly by their grandchildren who have grown up with that new explanation established.

That ‘latest’ science is neuroscience (currently fashionable). At the heart of what I present is a model of the brain (as formulated by others) variously called the ‘Bayesian Brain’, ‘Predictive Brain’ or Karl Friston’s ‘Variational Free Energy’ (the term that I generally mention) and ‘active inference’. I frequently refer to it by the phrase ‘hierarchy of predictors’ as I think this is a more descriptive, more accessible term.

There is no presumption that this model of the brain is ‘correct’ (as it might be viewed by our grandchildren). It is a grossly simple explanation for the most complex one-and-a-bit-kilograms you will find anywhere in the universe. But it is hoped that it provides a better model of how the brain works than any established model and is the most appropriate non-academic one for the purposes here.


Biology, Physics and Philosophy

For me growing up, physics was full of ‘crunchy’ big ideas whereas biology (wherein neuroscience lies) was the soggy accumulation of little facts:

  • of meticulous drawings of bats,
  • of naming the parts of a bat,
  • of cataloguing the 1,240 species of bats and
  • of estimating the number of bats per square mile.

Even biology’s big idea – evolution – was soggy, providing qualitative post-hoc explanations in contrast to physics’s quantifiable predictions.

Ultimately, there are two philosophical questions:

  1. ‘Why is there something rather than nothing?’, and
  2. ‘Why are we conscious, so as to be able to perceive that ‘something’ and to be able to ask the above question?’

Or, alternatively:

Physics promised answers to the former; biology ignored the latter.



But there have been significant developments in neuroscience since my formal education ended, not least in the ability to visualize what is going on. Coming late to the biology party, I was astounded by exquisitely crunchy mechanical behaviour in microbiology, such as in the machinery of the synaptic vesicle (see below).

This ‘crunchy’ side of biology has not been, and maybe still isn’t, conveyed to the general public. Biology is more precise and more appealing for physics-y type people after all.

And neuroscience now promises to provide some sort of answer to big questions, like ‘what is it like to be a bat?’.

Sorry Sheldon, physics is fuddy-duddy. Neuroscience is where it’s happening.


Amy Farrah-Fowler (neuroscientist) and Sheldon Cooper (physicist)



Simple and Un-rigorous

Philosophers pride themselves on their rigour. Philosophy has been described as ‘rigorous but not technical’ in contrast to science being ‘both rigorous and technical’.

But this blogsite might be described ‘technical but not rigorous’.

It is speculative. It relies on immature ‘pre-science’. It aims at simplicity. It is reductionist. It aims to be simple enough for an intelligent layman to have a basic understanding for it then to be a springboard to detail elsewhere. It is unconstrained by the shackles of rigour in academia. True, I occasionally cite academic papers and books, but I generally don’t want to clutter things up with justification. If you seek justification, just google.

And I try to avoid ‘neural correlates’. I try to avoid citing scientific associations between some phenomenological experience (such as empathy) and something physically observable within the brain (such as an increase in activity in the Anterior Cingulate Cortex, as observed from changes in blood oxygenation  in functional MRI scans). Such justifications make people more like to believe neuroscientific propositions -but they are clutter getting in the way of the bigger picture.

The approach is systemizing , trying to assemble ideas (typically others’ ideas) together to form that bigger picture. Again, the assemblings are not rigorous. But hopefully some may prove interesting or ring true.


Shades of Grey

Dualism obviously creates a sharp distinction between body and soul. But this also creates other sharp distinctions as a result. Between human and animal for example. And between responsibility and not. These are crisp black and white distinctions.

But with physicalism, dichotomies are presented just for simplicity of explanation. Continua (shades of grey) are there if we want to see them, particularly if it helps understanding. Barriers can be removed.


Between human and animal



And finally…

There might be some reason why I would want to distance myself this blog. I might not want to have it associated with my professional life. I might think it is embarrassingly amateurish but that there is some merit sharing it. I might think that my writing style is terrible (I certainly don’t pay that much attention to it).

But the blog is about ideas and it shouldn’t matter.

The site is anonymous. I just prefer it that way.



Photo credit: Patrick Bouquet via Also available in colour.

Posted in Uncategorized | 2 Comments

Guilt and Shame

What constrains people’s behaviour when no one else is looking?

Both guilt and shame are feelings resulting from oneself having committed a bad act. But:

  • Shame arises from others knowing that one has committed that bad act, whereas
  • Guilt arises from internally knowing that one has committed that bad act.

Their opposites are:

  • to have high esteem – a good reputation, and
  • to have high self-esteem.


Although shame and guilt exist in all societies to some degree there is a stereotypical idea, originating from E. R. Dodds, that:

  • Oriental societies are ‘shame societies’ in which social order is maintained primarily through shame, and
  • Western societies are ‘guilt societies’ in which social order is maintained primarily through guilt.

But shame versus guilt discussions are muddied by there being different understandings of the difference between the two. I think this can largely be resolved by thinking of four categories rather than two. In my terminology, this gets described as one type of guilt and 3 types of shame:

  1. Public shame’: the painful feeling arising from others having observed the improper behaviour. (Embarrassment is the much weaker cousin of public)
  2. ‘Ultimate shame’: the painful feeling of arising from an all-seeing God having observed the improper behaviour.
  3. Self shame’: the painful feeling of a negative evaluation of oneself. The focus is on the defectiveness of the actor (the self).
  4. ‘True Guilt’: the painful feeling resulting from a belief that one has done something wrong. The focus is on a defectiveness of the act.


  • For some, the demarcation between ‘guilt’ and ‘shame’ separates 1 from 2, 3 and 4: it is that shame derives from other people being aware of the misdemeanour.
  • For some, the demarcation between ‘guilt’ and ‘shame’ separates 1 and 2 from 3 and 4: it is that guilt derives from oneself recognising that one has done wrong.
  • For some, the demarcation between ‘guilt’ and ‘shame’ separates 1, 2 and 3 from 4: it is the distinction between the actor and the act; the distinction between the guilty “I did something bad” and the shameful “I am bad”.

Regarding the extra shades of shame, the most significant demarcation between ‘guilt’ and ‘shame’ is that separating 1 from 2, 3 and 4. With this demarcation, ‘ultimate shame’ and ‘self-shame’ are referred to as ‘guilt’. This is how confusion arises.

Shame versus Guilt around the world

Morality and Self-Regulation

As has been proposed previously, the purpose of morality is to balance the wants of oneself against the sometimes conflicting wants of others for the general mutual benefit of the many individuals in a society. It is a benign means of social control.

Shame is the feeling that arises from other people knowing about one’s misdemeanours resulting in damage to one’s reputation. Improving reputations benefits individuals and the wider society, leading eventually to a culture of the presumption of trust.

But, in a pure shame culture, it is still OK to do something wrong:

as long as no one knows you have done it!

credit: Scott Adams

Ensuring proper moral conduct by having something always watching

…because this will not damage one’s reputation. The basic rule is:

Don’t do bad, or don’t get caught!

In contrast, guilt should promote better moral cooperative behaviour in that it makes people behave well

even when there is no one else watching.

The basic rule is:

Don’t do bad!

which is more obviously aligned to what we understand about morality. A guilt culture should accelerate the presumption of trust and attain a higher level of trust than a shame culture. It is like empathy in that it is not essential for a moral society, but it helps.

Guilt aligns the values of the self with those of society:

  • shame arises from a violation of cultural or social values, while
  • guilty feelings arise from violations of one’s internal values.


  • shame involves the feeling of disgust of others towards oneself, whereas
  • guilt involves the feeling of disgust of oneself towards oneself.

In short, as a way of maintaining social order,

  • self-regulation of individuals is preferable to external regulation;
  • that is: guilt is preferable to shame.

Neuro guilt and shame

Catholic Guilt and Protestant Shame

Our reputation with others is not significant to us for all other beings. We are not likely to be concerned about our reputation with one’s neighbour’s dog, for example. We generally only care about how we are seen by other people – and not all people – because there are repercussions for us for our transgressions.

And, for those that believe, there is also a very significant other – an all-seeing God– with very serious repercussions for us for our transgressions. For them, guilt is not known only to that individual. God knows too and He can punish the sinner in the afterlife. This is what I have termed ‘Ultimate Shame’.

If an individual publicly confesses their sins, their anxiety will be reduced even though they then suffer shame. Individuals would obviously prefer to be shamed before as few people as possible and for this knowledge not to spread beyond them. Confession to a single discrete priest manages this. ‘Catholic guilt’ becomes ‘Catholic shame’. Actually, Catholics trade ‘Ultimate Shame’ (supposed ‘guilt’) for something halfway between ‘Ultimate Shame’ and ‘Public Shame’ – a ‘Limited Shame’. The individual has acknowledged their wrong-doing, been forced to reflect upon it and compare it against the values of wider society. In recognising their wrong-doing, they have demonstrated that it was the act that was bad and not the actor.

Southern Europe is said to have a shame culture whereas their Northern cousins have more of a guilt culture. Southern Europeans are predominantly Catholic whereas Northern Europeans are predominantly Protestant. They trade their guilt for shame but the Protestants are stuck with guilt. In either case, even when there are no human witnesses, acts are not entirely private; it is still generally shame rather than guilt that is involved.

And shame does not require punishment. For example, I suspect that this is the case for the majority of those in Northern Europe who state they are Christian on census forms. This majority have no direct outward practice of their religion from one census to the next. For them, they have an un-theologized, un-analysed, un-formalized ‘personal God’ with whom they have a relationship than helps them. There is almost certainly a hope of an afterlife but no pretence of knowing – indeed, no effort applied to knowing more. But there is no punishment codified. Wrong-doings result in shame. Their personal God is an entity that holds them to account – God is an other.

Self Shame

Atheists do not have ‘Ultimate Shame’. A society of only atheists would seem to be a shame culture in which you really can do anything as long as no one finds about it, as there would be no damage to reputation.

The secularization of the West raises concerns from many that there will be a decline in moral standards as a result.

This may be true or it may be false, depending on evidence (something to be looked at in the future). But it is not a given. If individuals did feel bad about it, there could still be what I am calling ‘True Guilt’ – an intrinsic bad feeling about oneself which would motivate people away from ‘bad’ behaviour.

And even this secular guilt can still be a form of shame. We can be shamed before the ‘other within’ with whom we have our internal conversation. I call this ‘Self Shame’.  We can be brought up (conditioned) to have that ‘other within’ questioning us. At times, it can be our conscience. Shame, and the presumed moral standards that arise from ‘others knowing our wrong-doings’, is still possible without an omniscient being.

Act and Actor

But it is also possible that we can be brought up (conditioned) without an ‘other within’ questioner. I started off saying that:

  • Shame and high esteem arise from others knowing, whereas
  • Guilt and high self-esteem arises from a self-

but I have basically categorized everything as a form of shame, except for the absence of shame. Guilt does not feature.

One more distinction between shame and guilt was:

  1. For shame, the focus is on the defectiveness of the actor.
  2. For guilt, the focus is on a defectiveness of the act.

Dualist Deontology and Physicalist Virtue Ethics

As I have frequently contrasted previously:

  • Dualists (of the ‘substance’ type) believe that mind and matter are separate, whereas
  • Physicalists reconcile the two, believing that ‘mind’ (such as it exists) supervenes on the physical matter.

For dualists, mind is pure, untainted and unconstrained by the material and therefore could exist after the destruction of the material body.

  • The religious are almost always dualists,
  • and physicalists are almost always not religious.


  • To a dualist, our bodies might be very different but our ‘minds’ are essentially the same, capable of making right and wrong choices – and being judged (now, or later) equally. And it is the acts that are judged, not the actor. Having recognized a sin, a mind can change and act differently next time. The rightness or wrongness is in the act. This is in line with the ethical positions of Deontology and Consequentialism.
  • But to a physicalist, a bad act is causally a result of a bad actor. It could not have been otherwise. If I sinned – and recognized that I did – then there is something wrong with me – the biological me. Rightness or wrongness is embedded in the actor and the actor cannot easily change. This is in line with the ethical position of Virtue Ethics.

In my terminology, the distinction between guilt and self-shame is that the focus is on the actor in the former and on the act with the latter. Thus:

  • Guilt is associated with dualism and act-based ethical positions.
  • Self-shame is associated with physicalism and virtue ethics.


This was the 18th part of the ‘From Neural Is to Moral Ought’ series. It built upon a predecessor part, ‘Trust’, in which social institutions evolve so that agents (rationally) self-regulate their behaviour. Here I have considered the emotions (bad feelings) of guilt and shame. (It is similarly parenthetical to the series in that it is not ‘neuro’ at all, but the well-worn dualism-versus-physicalism dichotomy is considered again.)

Where I am going with this:

  • In a physicalist worldview with virtue ethics, it is the actor that is bad. Bad acts cannot just be confessed away. This is shame, and shame can be a very destructive inability to change.
  • Guilt comes with heightened anxiety of some form such as repression or self-punishment. But shame can also be very destructive.
  • A physicalist worldview can also reduce personal responsibility – ‘my brain made me do it’.
Posted in Uncategorized | Leave a comment

Mirroring and Mimicry

This is the seventeenth part of the ‘From Neural Is to Moral Ought’ series and follows on:

Here, I look at how mimicry and mirroring the behaviour of others can arise in the ‘hierarchy of predictors’ model of the brain, which leads to us empathizing with them.

68: Mimicry and Contagion

From Observing Others to Acting Ourselves

From previously, we have seen that:

  1. Observing others precedes the observation of self. For example, the recognition of our own hands was built upon the observation of the hands of others. Hence there is a significant association between the two. The lowest levels in the hierarchy of our brains react to the observation of our own hands and those of others in the same way.
  2. We have learnt to integrate sense (the sight of own hand) with movement (of our own hand).
  3. There is therefore an association between the sight of another’s hand and the movement of one’s own hand.
  4. It is only later that we learn to distinguish the observation of oneself from the observation of others (at a higher level of the brain hierarchy).
  5. There is therefore a ‘leak’ as it were from observing others to our own movement. This is not something that can be entirely unlearnt at lower levels and must be corrected at a higher level.

The Rubber Hand Illusion Again

So, the observation of another can cause movement because of ‘mistakes’ at the low levels.

As described previously, low levels react quicker than higher levels. So those ‘mistakes’ made by the low levels are quickly corrected by higher levels. It is better:

  • to have low levels acting fast ‘in case what I see is actually happening to me’ and for the higher levels to then veto with ‘it is not me, after all’


  • to spend ages deciding ‘what I should do’, by which time it is too late and it becomes ‘what I should have done’.

For example, in the case of the ‘rubber hand illusion’, fear is generated by low levels predicting that our hand is about to be smashed. This becomes pain when the rubber hand has been hit, arising from the belief that our hand has been hit. But higher levels quickly break the association between the rubber hand and oneself. The pain is only fleeting. (Considerable effort is needed to fool a mature brain into adopting a rubber hand as its own in the first place – for example, through the stroking of the left hand in addition to the stroking of the rubber hand on the right.)

Sensorimotor Contagion: unconscious mimicry

When low levels are screaming out for attention, sending huge error signals upwards, higher levels will respond. But in some cases, there is not enough of an upward signal to warrant higher-level attention. The higher levels are otherwise engaged and no higher level vetoing gets done. (This lack of response is reminiscent of the ‘boiling frog’ anecdote.)

Hence we get behaviour such as this:

  1. We are talking.
  2. You have your arms folded.
  3. I am not paying attention to them but I sub-consciously notice that your arms are folded.
  4. Because of the ‘leak’ from observing others to own movement, I start moving my arms and continue to do so until there is no dissonance between the sight of your arms folded and the proprioceptive sense of where my arms are.

This is then an example of (what I shall call) ‘sensorimotor contagion’ – a sub-conscious mimicry.

But it can work the other way around. It could equally be:

  1. We are talking.
  2. I have my arms folded.
  3. I am not paying attention to them but I sub-consciously notice that your arms are unfolded.
  4. Because of the ‘leak’ from observing others to own movement, I start moving my arms and continue to do so until there is no dissonance between the sight of your arms unfolded and the proprioceptive sense of where my arms are.

Regardless of who mimics who, there is a tendency towards behaving in a similar way.

Another well-known example of mimicry is ‘yawn contagion’ where one finds it difficult not to yawn if one sees someone else yawning.

69: Emotions

Moral Development

Previously, it has been explained how the ‘hierarchy of predictors’ framework learns – with short-term, higher-level knowledge eventually getting relearnt at a lower level to become:

  • embedded in longer-term memory,
  • quicker, and
  • more instinctive.

The same applies to our moral learning.

Over time, we learn to balance the wants of others against those of ourselves. Conscious deliberation gives way to an automatic intuitive response.

  • When we are young we are dependent on our immediate, caring family and are selfish.
  • Gradually, we learn that it is sometimes better to put the wants of another first in the short term in the expectation that this will pay back in the longer term. This is a one-to-one relationship.
  • Balancing the wants of ourselves with others gradually comes more naturally. We no longer have to consciously deliberate about when and how another will pay the favour back. Give-and-take becomes a (sub-conscious) habit and this means we sometimes prioritize others when there is no pre-determined payback.
  • This then leads to the establishment of a reputation within the immediate social group with which we identify as being most like ourselves. This group can be the extended family, but, because this stage is reached among adolescents, it is commonly a group of teenage friends.
  • In adulthood, our experience of ‘those like us’ expands. We act morally towards strangers in our society. Moral deeds become a currency rather than a good. A favour from A to B does not have to be repaid (traded) by B. B can act well towards others and others will act well towards A. A’s payback is indirect and unquantifiable (as is everyone else’s).

Conscious moral deliberation (the rational) becomes habituated in a lower level and, as it does so it gets extended more generally to a wider and wider group of individuals.

This account of moral development is consistent with that of Lawrence Kohlberg’s.

Hierarchical Levels

In the ‘hierarchy of predictors’ framework, we can categorize the many levels into 5 zones. From top to bottom:

  • 5: Conscious deliberation,
  • 4: Sub-conscious,
  • 3: Emotional: prioritizing actions,
  • 2: Integrating the various senses,
  • 1: Perceiving the various senses.

Note that zones 1, 2, 3 and 4 all operate unconsciously.


Emotions are lumped somewhere in the middle of this hierarchy. They should clearly be below consciousness and are above the low-level sensorimotor levels. (We typically think of this hierarchy being the cerebral cortex, but emotion is also strongly associated with sub-cortical parts of the brain such as the limbic system.) The ‘hierarchy of predictors’ framework is, well, just a framework – a simple skeleton around which to build an understanding.

Emotions motivate. They can produce strong, sophisticated motor action from integrated sense input. Strong emotions will shut off the error signals upwards, making it difficult for the rational to override the actions resulting from those emotions.

And we feel emotions – it has a subjective quality. There is ‘something it is like’ to feel just as there is ‘something it is like’ to see. Anger has a subjective experience just as seeing the colour blue does.

Emotional Contagion

Consider the simplistic progression:

  • There is an emotional association between happiness and smiling.
  • If you are happy, you may smile.
  • If I see you smiling, I may mimic you and smile too.
  • Smiling has an emotional association with happiness.
  • So I am then happy.

But more directly:

  • If I see you smiling, I understand you are happy. (This may initially have been consciously but has become habituated and automatic).
  • My understanding of happiness is associated with my memory of the emotion of happiness.

This is ‘emotional contagion’.

Both the ‘zone 4’ (subconscious memory of emotion) and the ‘zone 2’ (sensorimotor integration) together pull on the ‘zone 3’ emotions.

Connecting to Our Emotions

We have seen how zones 1 (perceiving) and 2 (sensorimotor integration) are associated:

  • Low-level sensation does not differentiate between self and others.
  • This can lead to unconscious mimicry (see above).

And we have seen how zones 5 (the conscious) and 4 (the sub-conscious) are associated:

What happens at one level also happens at a neighbouring level (but generally at a different time).

And we have now seen that zones 2 (sensorimotor integration) and 4 (the sub-conscious) are associated with zone 3 (emotions).

This covers the complete vertical integration of the zones.

Contagious Well-Being

From the ‘sensorimotor contagion’ (above) we end up with mimicry – some combination of you mimicking me and me mimicking you. Regardless, from my perspective, you become like me. This is good because people similar to me generally behave like me and I am then more confident that I can predict their behaviour. There are no surprises and this contributes towards higher personal well-being. Being around people we are familiar makes us feel good. And well-being is an emotion.

Contagion, Autism and Psychopathy

It has been found that ‘yawn contagion’ is highest in those more empathetic and lowest in cases of autism and psychopathy. This is what would be expected:

  • The autistic are less susceptible to contagion because of their difficulty in making the cognitive connection to emotion. This is from zone 3 (emotions) to zone 4 (the sub-conscious) – they are less able to understand your emotion from their observation of you.
  • The psychopathic are less susceptible to contagion because of their difficulty in making the physical connection to emotion. This is from zone 2 (sensorimotor integration) to zone 3 (emotions).

Agency and Contagion

Agency concerns the ownership of senses and action – me or others.

  • If agency takes place in zone 4 (the sub-conscious), then our emotions are triggered both by things happening to ourselves and us seeing them happen to others.
  • If agency takes place in zone 2 (sensorimotor integration), we have no emotional attachment to what happens to others.

Perhaps not surprisingly:

  • The more empathetic have less of a distinction between self and other.
  • The more psychopathic have a stronger ‘sense of self’.

This speculation is an alternative account.

70: Mirror Neurons

No account of how neuroscience affects morality would be complete without referring to ‘mirror neurons’. These have been identified by some as the source of our empathy towards others.

An Overview

A simplistic overview of mirror neurons is as follows:

  • They fire when either I do something or when I see other people do something.
  • These mirror neurons are found in the premotor cortex, the somatosensory cortex and the inferior parietal cortex, and
  • They do not fire when the object is missing (i.e. the action is only pretended) or when the object is present but without the actor, or when the actor is artificial.
  • They are concerned with the goals and intentions of actions.
  • They mirror the actions and intentions (the ‘what’ and the ‘why’) of other people onto ourselves and therefore help us understand them.
  • Hence they are the physical basis of empathy, through which we can understand other people’s intentions.

Criticism no. 1: No Neuron Type

The term ‘mirror neurons’ creates the impression that there are a particular type of neuron which has the behaviour of firing fire when either I do something or when I see other people do something, and that these neurons are different in form from other neurons. This is not true.

Instead, it is better to speak at a higher level of a ‘mirror neuron system’ as part of a larger system.

Criticism no. 2: No Localization

It should not be surprising that there are neurons that fire both when we do something and when we see other people do something. We should expect to find such neurons beyond the premotor, somatosensory and the inferior parietal cortical regions. For example, in parts of the brain concerning low-level senses, there will be neurons that fire both when I see my hand and when I see yours. Such neurons (‘ordinary’ neurons) will be distributed over a wide area of the brain, even if they are not all categorized as ‘mirror neurons’.

And where they are to be found, mirror neurons are not exclusive. Within the 3 cortical regions that are their home, they constitute only about 10% to 20% of the neurons within the 3 cortical regions previously identified (the premotor cortex, the somatosensory cortex and the inferior parietal cortex).

Criticism no. 3: No Neural Correlates

These localization of mirror neurons to those 3 regions has been found from performing functional MRI (fMRI) scans. Tasks performed by someone are correlated against oxygen levels within particular parts of their brain and this infers higher brain activity in these areas.

But these ‘neural correlates’  makes every area of the brain special. Each area is doing the task it has been correlated with.

This is antithetical to all the ideas here which are based around theorizing about how the brain is working rather than just classifying it. And the particular theory here is the ‘hierarchy of predictors’.

Saying that there is a ‘mirror neuron system’ which provides us with an ability to empathize (for example) suggests that this ‘system’ is doing something special – something different from what other circuits of the brain are doing. ‘Special’ areas have a hint of magic about them; their explanations do not explain. A theory provides an understanding of why a particular area performs the function that it is correlated with (we are rather a long way off that when it comes to the brain). Theories are more parsimonious. fMRI evidence can support or falsify a theory but it cannot replace it.

Counter-Criticism no. 1: Neural Correlates

Note however that research has shown that neural correlates can make a statement more ‘true’! Any statement that is made which is supported by a claim (just a claim) about a relationship between it and some fMRI scanning result is more like to be believed by the general populace than if no neuroscientific claim was provided.)

For example…

The areas that these mirror neurons are ‘found’ actually do relate well to what would be expected in generating motor actions from sensory input:

  • The premotor cortex possibly handles the planning of movement,
  • the somatosensory cortex handles our sense of touch around our body (it is the location of ‘Penfield’s Homunculus’), and
  • the inferior parietal cortex is the location of sensory integration.

Counter-Criticism no. 2: No Simulation

Patricia Churchland is sceptical of the claims of mirror neurons, in two ways. The first is that the

‘whole claim that empathy depends on simulation’

has not been established.

Now, simulation is another name for prediction, but one that could only be applied to high-level prediction – conscious deliberation. High-level empathy is ‘cognitive empathy’ and this is not the same as the ‘emotional empathy’ that Churchland is meaning. So in that way,

empathy does not depend on simulation

But the underlying neuroscientific framework to everything here – the ‘hierarchy of predictors’ (or the ‘predictive brain’) is that all levels of the brain are predicting, including emotional levels. In that way:

empathy must depend on prediction

Counter-Criticism no. 3: No Feeling

The second of Patricia Churchland’s criticisms mentioned here is that she scoffs at the idea that we actually feel what others feel when we see them in pain. Recounting seeing another get stung by a wasp, she did not feel any pain in herself corresponding to that that will have been felt by the other. Instead:

 ‘what I did feel was a visceral generalized sense of awfulness’.

But each and every one of us are different. Over 1% of us are Mirror-Touch Synaesthetes who do claim that they actually feel what others feel when we see them in pain. Personally, I concur with the ‘generalized sense of awfulness’ but there is also a fleeting localized feeling (yes, I will call it a feeling) in the corresponding part of me at the start, although it quickly diffuses away.  This is consistent with the ‘low-level fast’ and ‘high-level slow’ processes at work in the ‘rubber hand illusion’ described previously. We feel the pain of others more quickly than we can attribute ownership to it.

But Does it Make Any Difference?

Previously, I have argued that societies of trust can evolve just from the seed of maternal care. But the existence of empathy, arising from mimicry and mirroring, greatly accelerates that development. It motivates the vast majority of us to act in a more cooperative way.

But that doesn’t mean that our moral decisions should be driven by empathy – or any of our emotions for that matter. They might actually get in the way!  They may make us more likely to make irrational (sub-optimal) moral choices! And not all individuals have the same degree of empathy. We need to have to have a morality that works with all types – all different types. But we also need one that practically works for how we are, physically constituted.

Mirroring and mimicry does not determine morality


morality must account for our mirroring and mimicry.

Posted in Uncategorized | Tagged , , , , , , , , , , | 2 Comments

The Learning Pyramid

Relearning Hierarchies of Predictors

The ‘hierarchy of predictors’ hypothesis that operates by the ‘minimization of surprise through action and perception’ has been introduced and described elsewhere but the account has been static. It does not account for the way the brain organizes itself dynamically by learning through experience. The existing hypothesis is:

That there are many levels arranged in a hierarchy, with the bottom layer having sense input and motor output to the wider environment. Each level is trying to minimize surprise by maintaining a model of its wider environment within itself in order to predict its sense input and thereby act in response. At each level, the action associated with the prediction is enacted in proportion to how successful the prediction is, and the sense error is passed up to the sense input of the next higher level. Actions from that higher level propagate downwards in proportion to how wrong the prediction was. The level learns from the experience (adapts its predictions):
1. Proportionate to how wrong it was,
2. Inversely-proportionate to how much learning has gone before.

The elaborated hypothesis adds:

Ordinarily, new tasks (which don’t fit in with any previous predictions), propagate up to high (conscious) levels where they will be acted upon clumsily (slowly). Repetitive occurrences of this new task will initially allow the high levels to improve their predictions. But after a while, lower levels will also learn and will end up acting upon the task and shutting off the higher levels (the error propagated upwards becomes small). Over time, the task will get ‘relearnt’ at lower and lower levels and this allows the response to become faster. They become habitual.
Although higher levels react more slowly than low level ones (because they are further away from the environment), they adapt to changes more quickly.
In the early stages of life, there is no prior experience. No level is any good at predicting. Action will be determined by many levels and there will be learning at many levels, but the higher levels will learn fastest. However, whatever learning there is at lower levels will shut off learning at higher levels and over time this lower-level learning will come to determine behaviour. So, early gross pattern learning eventually beds down at the lowest levels and more refined pattern learning beds down on top of that.
In early learning, the lower levels will be making poor predictions and propagating significant errors up to higher levels for them to adapt to. But some action will be generated from those lower levels – and poor action at that. The higher levels must provide strong action downwards to try to suppress this poor behaviour. Over time, the higher levels will learn to react less strongly as their error inputs become weaker.

The hierarchy thus self-organizes its learning.

Higher levels are refined models that can easily change. It can be said that it is here that there is short-term memory. Lower levels have entrenched behaviour and thus represent long-term memory.

Learning Pyramids

Learning Pyramids

Integrating Pyramids of Predictors

A linear hierarchy as a model of the brain is of course a gross simplification. But simplification (‘abstraction’) enables us to see the wood for the trees – to get beyond the mass of detail in order to get to some understanding of the stupendously complex object that is the brain.

A ‘pyramid of predictors’ model is better than the ‘linear hierarchy of predictors’ model in that it provides more explanatory power for only a minimal increase in complexity.
This new model introduces the following refinements to the linear hierarchy thesis:
• Generally, a process communicates down to more than 1 lower-level processes and is just one of many communicating to the process above it.
• At low levels, there is local action in response to a local stimulus. But higher up the hierarchy, more information is brought together, finding patterns across a wider range of sense input.
• Action is directed towards the child process that is feeding the largest error upwards. (This can sometimes lead to a misunderstanding of what is going on.)

So we get the self-organization of predictions being performed at the appropriate level in the pyramidal hierarchy as a result of:
1. The appropriate range of speed and sense input (lower levels operating quickly and higher levels being able to draw upon more information), and
2. The ‘relearning’ at lower levels.

We can speculate that this self-organization develops into the following levels or processes, from low (fast) to high (slow):

  1. Sensorimotor reflex: At the lowest level, there is a very close coupling between sense input and motor output. Example: reflex action.
  2. Sensory integration: Identification of patterns among multiple sense inputs, leading to a local expectation/prediction which may be confirmed or refuted. An example of this is the ‘information processing’ of detecting horizontal/vertical lines in the visual cortex of cats.
  3. Sensorimotor integration: The interaction between a particular sense function and the motor functions associated with that sense, such as vision influencing saccadic eye movement.
  4. Sensory integration: Higher-level identification of patterns of a particular sense, such as recognizing hands and how hands move.
  5. Sensorimotor integration: The interaction between a particular sense function and motor functions not associated with that sense. For example, during development it is found that two particular hands are exceptional (they are the exceptions to the learnt patterns of how hands move) – their jerky movements are surprising. Through integrating motor action with sense input, it becomes possible to learnt how to control these hands so they move like others hands. (As yet these hands are not yet one’s ‘own’; there is no ‘self’ yet.)
  6. Hypotheses and Deliberation: All ‘perception is hypothesis’ but this become more apparent at higher levels in the hierarchy. Some hypotheses result in action in the environment; some hypotheses result solely in imaginative ‘deliberations’ within the brain. All these must work within the limits of prior experience.
  7. Multi-modal sensory integration: Different sensations are integrated. Where there is conflicting information between senses, one ‘guess’ (hypothesis) must win. An example of this is the McGurk integration of sight and sound.
  8. Full sensorimotor integration: All senses and motor functions are united to create a coherent proprioceptive ‘body’, complete with feelings and emotions.
  9. Agency: A sense of ‘me’ arises. Those ‘exceptional’ hands identified earlier are indeed exceptional because they are mine. But things can still go wrong with identifying ‘me’ such as with the momentary confusion in the ‘rubber hand illusion’.
  10. Conscious deliberation: At the top are long-term deliberations which sometime get distracted by lower-level emotional feelings screaming for attention and sometimes are able to suppress those emotions.

Some key points of that hierarchical self-organization are:
• It is not (necessarily true) that we detect that others are like us – it can easily the other way around: we learn that we are like others. ‘Others’ precede ‘us’. In our pattern-matching learnings, we have copied others.
• Sensory detection of, for example, hands happens at a much lower level than the level that identifies whose hands they are.
• Agency happens at a higher level than feelings.

(Whilst quite stand-alone, this was the sixteenth part of the ‘From Neural Is to Moral Ought’ series in that it provides the necessary background to get from parts XV to XVII.)

Posted in Uncategorized | Tagged , , , , , | 1 Comment


(This is the fifteenth part of the ‘From Neural Is to Moral Ought’ series. It looks at emotional empathy, particularly by considering those with non-typical empathy.)

Empathy, Psychopathy and Autism

Care, Anxiety and Trust

Previously, we have looked at the biological development from solitary animals through to large societies:

  1. The emotional connection that drives pair bonding and the caring for offspring.
  2. Greatly-extended families have a social pecking order. This reduces in-group physical violence and promotes cooperation but there is considerable anxiety.
  3. Cooperation builds upon established reputations of individuals. Ultimately, this leads to the creation of institutions where there is trust between strangers by virtue of their affiliation to those institutions. Nothing more is required to build up a large society with established customs.

These customs, defining right and wrong ways of behaving, are the basis for morality, which improves the well-being of society’s members by reducing physical violence and psychological anxiety.

This progression elevates us up from an existence famously described as being one of:

“continuall feare, and danger of violent death; And the life of man, solitary, poore, nasty, brutish, and short.”

… to a life that can be sociable, enriched and long.

Chimpanzees and Psychopaths

But it is still a brutish society. Each member acts only for its own interests. Short-term interests are sometimes sacrificed in favour of longer-term interests. Acts are either selfish or altruistically selfish. Chimpanzees (creatures considered previously and having high cognitive abilities) in groups have been described as a society where everyone is a psychopath because of the behaviour of individuals:

  1. Their practices of deception and manipulation.
  2. Their faking of emotions to get attention and influence of others.
  3. Their only caring about hurting others because of what others will think; they do not appear to care unless someone else sees it.
  4. There is precious little loyalty involved in male chimpanzee coalitions

(Note: I am using the term ‘psychopath’ rather than ‘sociopath’ here and treating them as interchangeable terms.)

Humans have significantly higher cognitive abilities than chimpanzees. But, based on the argument so far, this would just mean that their deceptions and manipulations and their faking of emotions would be more sophisticated.

Such a society would be missing an important ingredient: one that could accelerate and help build up a society of institutions that could provide longer lives of better well-being. That ingredient is empathy.

Cognitive Empathy vs Emotional Empathy

Empathy is commonly divided into two types:

  • ‘cognitive empathy’ is the ability to understand what another is thinking – essentially the same as having a ‘theory of mind’ discussed previously, and
  • ‘emotional empathy’ is the ability to feel what another is feeling.

Psychopathy is typically characterized as a lack of emotional empathy but not of cognitive empathy.  A psychopath:

  • can know what you are thinking, but
  • cannot feel what you are feeling.

Autism and Empathy

Autism is also typically characterized as a lack of empathy, particularly with a reduction or lack of a ‘theory of mind’  and characterized as an ‘extreme male brain’ which is a combination of a low ‘empathy quotient’ with a high ‘systemizing quotient’.

Instead, it is better to characterize autism as a result of delayed (and therefore reduced) ability in perceptual abilities to read the mind of others. For example, whether it is a contributing factor or an effect of autism, the characteristic lack of eye contact reduces the chances of correctly discerning the emotions of another in a specific situation and compounds the problem by providing a lack of opportunity to learn such emotional expressions. We can hypothesize that the flip-side of this is that the time spent not learning social skills is spent on developing systemizing interests instead. In terms of Daniel Dennett’s theory of levels of abstraction, they develop skills and interests concerning the ‘design stance’ when they would otherwise be developing an ‘intentional stance’ and a ‘theory of mind’. Their confusion among their peers in the social sphere leads to anxiety which is managed by focussing inwards onto their highly-developed interests. They will be reluctant to initiate social interaction. Once started however, they may talk incessantly about such interests, failing to understand that your interests are different from theirs.

But none of this represents a lack of empathy. To have ‘emotional empathy’ is to respond to the perceived emotional state of another by experiencing feelings of a similar sort. However, this can produce different responses:

  • To reflect back to the other as concern for the suffering of the other, or
  • To be absorbed, with self-centred feelings of anxiety and distress.

For the ‘severely autistic’, lacking a theory of mind, empathy will manifest itself as distress. For ‘high-functioning autists’, empathy will manifest itself as concern; once aware of another’s feelings, they can have the same degree of compassion as anyone else.

Incidentally, an alternative interpretation is the ‘intense world theory’ from Henry Markram (of ‘Blue Brain Project’ and ‘Human Brain Project’ fame) and wife Kamila. It is that those with autism perceive, think and remember too much so that they retreat into a safe bubble to protect themselves from the pain of intensity. Virtually all people with autism spectrum disorder report various types of over-sensitivity and intense fear.

The Integration of Senses:  Synaesthesia

The McGurk Effect

Our senses do not operate in isolate but work together to discern the best understanding of the environment. For example, in listening to someone as part of a conversation, we have the visual sensation of seeing the lips moving in addition to the sound of hearing them. Surprisingly, when the brain tries to integrate the two senses, it is the visual sense generally dominates, as is illustrated by the McGurk effect. Seeing the lips saying ‘far’ synchronized with the sound ‘bar’ makes us perceive ‘far’. By looking away from the speaker, we hear ‘bar’ again for the same auditory input.

The Rubber Hand Illusion

The ‘Rubber hand illusion’ is another well-known example of sensory integration. This time, it is the – integration of seeing and feeling.

CAPTION: The Rubber Hand Illusion

The subject places two hands on a table. A cloth covers the left arm and hand and a rubber hand sticks out of the cloth around where we might expect the left hand could be. The experimenter repeatedly strokes the fingers on the right hand in turn and simultaneously strokes the corresponding finger of the prosthetic hand. After a while, the subject starts to perceive the rubber hand as their own hand. Suddenly, the experimenter hits the rubber hand with a hammer. The subject feels pain in their left hand even though that hand has not been hit. The experience is fleeting. The ‘spell’ is broken and it is soon clear that the left hand is not hurt. But the anticipation / expectation (the prediction) of pain results in the experiencing a pain.

Again, it is the visual sense that dominates.


Synaesthesia (‘sense fusion’) is a condition where a stimulus causes both a normal experience and an additional sensation. The normal and additional senses are thereby associated.

For example, in ‘grapheme-colour synaesthesia (prevalent in about 1 person per 70), there is the perception of experiencing a colour in addition to recognizing of a number or letter. Note: this experience is simultaneously with seeing the number/letter in the colour it really is). A grapheme-colour synaesthete might see a numeral as actually being black yet perceive the numeral as being green or red depending on the numeral. This would make it easier to identify different numerals in the example below for example.

Left: reality. Right: synaesthetic perception.

There is therefore in this case the close integration of:

  1. Vision, and
  2. The concept of numbers and letters.

Mirror-Touch Synaesthesia

As indicated in the ‘Rubber Hand Illusion’, the sight of something happening to one’s body causes the expected feeling associated with that event. This integration of vision and the ‘somatic senses’ (bodily senses such as touch) is normal. But for those with Mirror-Touch Synaesthesia, the sight of something happening to the body of another causes the expected feeling associated with that event in themselves. Literally, ‘I feel your pain’.

Empathy and Pain

Empathy is Pain

Tolerance of pain and sensitivity to pain appear to be related to our ability to feel the pain of another:

  • Mirror-touch synaesthesia is linked with empathy and Mirror-Touch Synaesthetes are particularly sensitive to pain.
  • Psychopaths feel pain but are able to disregard it. Pain then no longer serves its function of modifying the behaviour of the individual in both the immediate circumstance and in the future in order to protect that individual. The lack of pain makes an individual reckless. Thus, we may view psychopathy as being less an issue of deficient empathy and more as an issue of deficient feeling and emotion.

Empathy, Sense and Perception

For the non-‘neurotypical’ types considered above, it is possible to categorize them in terms of empathy:

  • A psychopath has cognitive empathy but not emotional empathy.
  • A mirror-touch synaesthete has cognitive empathy and an extreme emotional empathy.
  • Those on the Autism spectrum have difficulties with emotional empathy, with varying degrees of cognitive empathy ranging from normal (Aspergers) to having no ‘theory of mind’ (severely autistic).

These terms are external (concerning social relationships between individuals) and psychological.

Alternatively, those type can be categorized in co-related terms of sense and perception:

  • A psychopath has much impaired sensitivity to feeling.
  • A mirror-touch synaesthete has much increased sensitivity to feeling.
  • Those on the Autism spectrum have impaired perception of feeling.

These terms are internal (about single individuals) and fit more into a neuroscientific framework.

Different Others

What role does empathy have to play in morality? We can ask:

  • ‘what if everyone was a psychopath?’ (hint: see the comparison with chimpanzees earlier), and
  • ‘what if everyone was a mirror-touch synaesthete?’

and make the obvious conclusion that the latter would be an improvement over the former in terms of general well-being. But that is not to be considered here.

In reality, we must be asking:

  • ‘how do we deal with a society in which individuals have different levels of empathy?’, or
  • ‘how do we deal with a society in which individuals have different cognitive abilities?’.

We cannot have the ‘Golden rule’ expectation that others want what we want. In learning how to balance the wants of others with the wants of ourselves, we need to be able to understand what those others want. Whether we think in terms of empathy or in terms of feeling, we must accept that others are different from us and so the ‘Platinum rule’ applies.

Next: Continuing with empathy: mirroring and mimicry.

Posted in Uncategorized | Tagged , , , , , , , | 7 Comments

The Bayesian Inference of the Goat in the Game-Show

1. The Bayesian Brain

Many other posts on this blogsite refer to Karl Friston’s section within ‘Variational Free Energy’ theory of the brain as a hierarchy of predictors that adapt internal models of the environment based on experience, essentially using Bayesian inference. This is the so-called ‘Bayesian Brain’.

Here, I look at the well-known ‘Monty Hall problem’ to explain, using Bayesian inference, why the problem’s answer is correct and why people so often choose the wrong answer.

2. Bayes Theorem

Bayes Theorem is

P(H|D).P(D) = P(D|H).P(H)

(Notation: P(H|D) denotes the probability of H, given that D is true.)

To get there we start with the obvious relationship that the probability of both events A and B occurring commutative.

P(A ^ B) = P(B ^ A)


P(A ^ B) = P(A).(P(B|A).

This can be graphically represented with a Karnaugh map as shown below:



  • The 4 boxes represent the 4 possible combinations of A and B being either true or false.
  • The 2 red boxes on the right represent the 2 combinations where A is true: P(A).
  • The bottom right box in the picture below represents the combinations where B is also true: P(B|A).P(A).

But similarly by commutativity.

P(B ^ A) = P(B).(P(A|B)

and hence we get to Bayes theorem:

P(A).P(B|A) = P(B).P(A|B)

which is frequently rearranged to:

P(B|A) = P(B).P(A|B)/P(A)

3. What Does it Mean When a Girl Smiles at You Every Time She Sees You?

To provide an example of this, I will duplicate Mark Eichenlaub’s answer (in case it disappears from Quora) to the question:

What does it mean when a girl smiles at you every time she sees you?

His answer is as follows:

It’s simple. Just use Bayes’ theorem.

The probability she likes you is

P(like|smile) = P(smile|like).P(like)/P(smile)

P(like|smile) is what you want to know – the probability she likes you given the fact that she smiles at you.

P(smile|like) is the probability that she will smile given that she sees someone she likes.

P(like) is the probability that she likes a random person.

P(smile) is the probability that she will smile at a random person.

For example, suppose she just smiles at everyone. Then intuition says that fact that she smiles at you doesn’t mean anything one way or another. Indeed, P(smile|like) = 1 and P(smile)=1, and we have

P(like|smile) = P(like)

meaning that knowing that she smiles at you doesn’t change anything.

At the other extreme, suppose she smiles at everyone she likes, and only those she likes. Then P(smile) = P(like) and P(smile|like) = 1.  Then we have

P(like|smile) = 1

and she is certain to like you.

In the intermediate case, what you need to do is find the ratio of odds of smiling at people she likes to smiles in general, multiply by the percentage of people she likes, and there is your answer.

The more she smiles in general, the lower the chance she likes you. The more she smiles at people she likes, the better the chance. And of course the more people she likes, the better your chances are.

Of course, how to actually determine these values is a mystery I have never solved.

4. Bayesian Inference

In the above example, we are wanting to see how justified we are in inferring a particular hypothesis (‘she likes me’) based on some particular evidence (‘she smiled at me’) and ‘Bayesian inference’ was used for this.

Generalizing this inference, Bayes theorem can be rearranged to:

P(H|E) P(E|H).P(H)

which is interpreted by the Bayesian (for whom probability represents knowledge) as:

posterior ← likelihood . prior

That is:

  • We start with a ‘prior probability’ degree of belief in a particular hypothesis, P(H).
  • New evidence, E, is presented.
  • We then calculate the new ‘posterior probability’, P(H|E), which is the degree of belief for the hypothesis H after taking into account the evidence E for and against that hypothesis H. This new degree of belief can be more or less than it was before, depending on the evidence.
  • The‘conditional probability’ or ‘likelihood’, P(E|H), is the degree of belief in the evidence E, given that the hypothesis H is true.
  • P(E) is irrespective of the hypothesis and so can be ignored (the ‘=’ changes to a proportional-to ‘∝’).

5. The Curious Incident of the Dog in the Night-Time

In the novel ‘The Curious Incident of the Dog in the Night-Time’, Christopher, the 15-year-old narrator, tells of the events following his discovery of a neighbour’s dog having been killed. Christopher has Asperger’s syndrome and is very mathematically-minded, which is apparent in his account. One mathematical excursion is into the ‘Monty Hall problem’ which he describes as follows:

You are on a game show on television. On this game show the idea is to win a car as a prize. The game show host shoes you three doors. He says that there is a car behind one of the doors and there are goats behind the other two doors. He asks you to pick a door. You pick a door but the door is not opened. The the game show host opens one of the doors you didn’t pick to show a goat (because he knows what is behind the doors). Then he says you have one final chance to change you mind before the doors are opened and you can get a car or goat. So he asks you if you want to change your mind and pick the other unopened door instead. What should you do?

(Note: Monty Hall was a game show host).

So, what do you think?


‘The Curious Incident of the Dog in the Night-Time’

6. The ‘Non-Mathematical’ Solution

Most people think it does not matter whether they stick or switch. But they would be wrong – it is actually better to switch.

In ‘The Curious Incident…’, Christopher provides two explanations of why. The first is Bayesian and I’ll come back to that later. But…

The second way you can work it out is by making a picture of all the possible outcomes like this

The decision tree solution P(A)

So, if you change, 2 times out of 3 you get a car. And if you stick, you only get a car 1 time out of 3.

And this shows that intuition can sometimes get things wrong. And intuition is what people use in life to make decisions. But logic can help you work out the right answer.

I think this decision tree approach (rather than the Bayesian way) is how most people consciously reconcile themselves to the solution. But the Monty Hall problem is not a problem because the solution is difficult to prove or anything like that. It is that people’s intuition doesn’t work here. Why is that?

7. Nothing Changes if You Stick

For the Bayesian solution, I will adopt Christopher’s notation:

Firstly you can do it by maths like this

Let the doors be called X, Y and Z

Let CX be the event that the car is behind door X and so on.

Let HX be the event that the host opens door X and so on.

The mapping of names to doors is arbitrary. Consistent with Christopher, let us just say that you choose door X and then Monty will open either door Y or door Z.

The normal approach from here on is to look at the probability of winning if you switch. I’ll come to that later on but first I want to look at things the other way around – to look at the probability of winning if you stick with door X. That’s because, as well as trying to provide an explanation that is as clear as possible, I also want it to help explain why people’s intuition is wrong.

So, consider the probability of winning the car if you stick with the door you first chose. Firstly in mathematical parlance:

P(win the car if you stick)

= P(HY ^ CX) + P(HZ ^ CX)

= P(CY).H(HZ | CX) + P(CZ).P(HY | CX)

= (1/3 .p) + (1/3 .(1-p))

= 1/3

And in a more wordy form:

P(winning the car if you stick)

= P(car is behind door X after the host opened door Y)

+ P(car is behind door X after the host opened door Z)

= P(car is behind door X) . P(host opens door Y if the car is behind door X)

+ P(car is behind door X) . P(host opens door Z if the car is behind door X)

= P(car is behind door X) . PK


PK = (P(host opens door Y if the car is behind door X)

+ P(host opens door Z if the car is behind door X))

Now, Monty has to choose between opening door Y or door Z. He is likely to choose with equal probability but his actual choice is irrelevant. In any case:

PK = 1

Hence simply:

P(winning the car if you stick)  = P(car is behind door X) = 1/3

The posterior probability is the same as the prior probability.

This is a mathematical reason why people intuitively feel that Monty opening one of the other doors doesn’t make any difference:

The new information hasn’t changed the probability of winning with the first-chosen door.

8. Everything Changes if You Switch

But things have changed. With one door opened to reveal a goat, there are two doors to choose from. We want to choose the door with the highest probability of winning. The mistake made is to think that the decision P(CZ) > P(CX) (if door Y was opened) will not have changed if one side of the inequality, P(CX), has not changed. We need to look at the other doors to see this.

Prior to opening any doors:

P(CY) = 1/3; P(CZ) = 1/3.

And after Monty has opened a door – let us say suppose that it is door Y:

P(CY) = 0; P(CZ) = 2/3.

Where previously P(CZ) = P(CX), the new relationship is P(CZ) > P(CX) – it is worth switching.

And if we suppose that Monty opens door Z:

P(CY) = 2/3; P(CZ) = 0

and the new relationship is P(CY) > P(CX) so that it is worth switching.

In all cases, it is worth switching.

More thoroughly, Christopher’s solution in mathematical parlance is:

Supposing that you choose door X, the possibility that you win a car if you then switch your choice is given by the following formula:

P(HZ ^ CY) + P(HY ^ CZ)

= P(CY).P(HZ | CY) + P(CZ).P(HY | CZ)

= (1/3.1) + (1/3.1) = 2/3

And in more wordy form:

P(winning the car if you switch)

= P(car is behind door Y after the host opened door Z)

+ P(car is behind door Z after the host opened door Y)

= P(car is behind door Z) . P(host opens door Y if the car is behind door Z)

+ P(car is behind door Y) . P(host opens door Z if the car is behind door Y)

But Monty will always open door Y if the car is behind door Z and vice-versa. The two probabilities are both 1. Hence:

P(winning the car if you switch)

= P(car is behind door Z) + P(car is behind door Y)

= 2/3

9. Postscript: Are Birds Smarter Than Mathematicians?


An Erdős biography. See Chapter 6: ‘Getting the Goat’

Top of the list of distinguished mathematicians who ‘got it wrong’ was the most profilic mathematical genius of the 20th Century, Paul Erdős:

‘No, that is impossible. It should make no difference.’

Even the decision tree approach failed to convince him:

‘You are not telling me why to switch.’

He only accepted that switching really was the better strategy after seeing ‘Monte Carlo’ simulations (repeating the game many times with random responses and counting the wins/losses).

The wonderfully-titled paper ‘Are Birds Smarter Than Mathematicians? Pigeons (Columba livia) Perform Optimally on a Version of the Monty Hall Dilemma’ describes how experiments on pigeons and humans repeatedly ‘playing’ the Monty Hall game showed that  the former quickly adapted their strategy to switching but that the humans clung onto their intuitions. The authors say of Erdős:

‘Until he was able to approach the problem like a pigeon—via empirical probability—he was unable to embrace the optimal solution.’

Posted in Uncategorized | Tagged , , | 1 Comment


This is the fourteenth part of the ‘From Neural Is to Moral Ought’ series. It examines and elaborates on a particular point in Patricia Smith Churchland’s ‘Braintrust: What Neuroscience Tells Us about Morality’.

58: The Braintrust Thesis


  • The idea has been introduced of a physical ‘agent’ acting and reacting in its environment, learning how to predict events in that environment so as to be able to act to further its goal (and that goal may be nothing more that self-preservation). The environment may well have other agents in it. These other agents will probably be the most difficult things to learn about because they are also learning to predict. Agents may be animal (and generally we are thinking of human), automata (robot) or alien.
  • Patricia Churchland introduced the idea that morality stems from the behaviour of neurotransmitters within the brain that influence pair bonding and the caring of offspring. She specifically looked at the neurotransmitters Oxytocin and Vasopressin.
  • The brain has been presented as a collection of processes with some sort of hierarchy and a small group of agents can be seen as an extension of this – a single thinking/predicting ‘super-agent’ acting against the environment, where the boundary between agent and environment gets shifted to that between the ‘in-group’ and the ‘out-group’.

One brain process competes against another to determine the actions of the agent but this competition is for the benefit of the whole agent. Similarly, what competition there is between in-group agents can be seen as for the benefit of the whole in-group.

The Thesis

Princeton University Press

‘Braintrust: What Neuroscience Tells Us About Morality’

To caricature Patricia Churchland’s ‘Braintrust’ thesis, it is essentially that the caring behaviour controlled by Oxytocin and Vasopressin is sufficient for the extension of care to a wider community. From this care of the immediate family members, we can get to the establishment of norms of behaviour for the general well-being of society’s members – i.e. to morals.

Beyond what we share with other mammals, plus those often-mentioned advantages we possess over other animals – our mental (practical problem solving), vocal and manual dexterity – there does not need to be anything else encoded in our genetic make-up to get us from ‘primitive’ living to modern society. And there could not be. Over the past 10,000 years, humans have progressed from hunter-gatherers in small groups to citizens of the ‘global village’ but this is too short a time for genetic evolution to have worked its magic to be an explanation for this development. (The ability to digest animal milk is one of the few adaptations over this timescale.)

59: Mothers and Others

The first step for extending care to wider society is to progress from the immediate to the extended family. And the most basic form of this is the raising of young by adults that are not their parents – ‘Allo-parenting’. Only 3% of mammal species partake in alloparenting (compared with 9% of birds) but our rodent friend the prairie vole is one of them. Oxytocin can go as far as enabling auntie meerkats to breastfeed (‘allolactation’).

Now, allo-parenting might be explainable in genetic terms: caring for young relatives helps the survival of most an individuals’s genes. But it is something that doesn’t need genetic explanation – it can come for free following the action of Oxytocin/Vasopressin in the normal caring of immediate young. We do not need genes to favour the care of others over and above those of our offspring; we just need genes that favour the young that:

  • look/sound/smell like us (or rather, like our carers and siblings), and
  • are in the local environment.

60: On Aggression and Cooperation

Hierarchy in Animal Societies

Looking at groups of animals beyond the immediate family, we see that they have a hierarchical structure.  Packs of predators and herds of their prey will have a pecking order which may be based on birth order but generally determined by the ability to dominate – which comes down to size and strength.

Having a pecking order – from the alpha (alpha male, female and/or pair) down to the omega – reduces fighting within a group and hence reduces injuries which would disadvantage the whole group. This is best for the long-term survival of the group. In-fighting is most likely to arise:

  • When resources are limited, with not enough food to go around. It is the omegas that starve.
  • In competition for food or a mate. It is the alpha that gets first pickings, and sometimes all the pickings.

Hierarchy and Stress

The presence of a pecking order may lessen overall violence, but only moderately so. A lessening of hierarchical control can make low-ranking individuals less miserable.  Robert Sapolsky has studied baboons in the wild at length and says that:

“their primary source of stress, like those of humans in modern society, is psychological rather than physical. Food is plentiful … Predators are few … With the luxury of plentiful resources and free time, the animals can devote themselves to distressing one another.”

“Violence itself is actually rare, but the hint of violence is ever present.”

“The animals who occupy the more subordinate positions are filled with a stressful lack of both control and predictability.”

But life at the top is also filled with stress – the stress of the ever watchful fear of being attacked (sometimes fatally). None of this is good for overall well-being.


Grooming for fleas has obvious practical benefits. But grooming is also therapeutic for both the groomer and the groomee – the act of grooming helps relaxation. Heart rate is reduced. Stress is reduced.

In the hierarchical society beyond the familial environment, grooming is asymmetric: low-ranking groom the higher-ranking far more than vice versa. But the benefit is mutual:

  • The lower-ranking are on the receiving end of less violence from higher-up, and
  • The higher-ranking build up trust which can be useful later on.

In both cases, an agent establishes a reputation – a predictable dependability.

Building Trust

Beyond grooming, more advanced means for building trust are to make oneself vulnerable to another:

  • Allowing another to suck your fingers,
  • Allowing another to fingers your eyes, and conjecturally
  • Allowing another to hold one’s testicles!

That last one has been observed both:

  • For an alpha male letting another to hold its testicles to build trust, and
  • Between ‘bachelor’ baboons, building up trust ahead of an attack on the alpha male (see below).

Game Theories

For a coalition attack on the alpha:

  • There is a high cost to an individual if they are betrayed by the others – the injuries inflicted from getting beaten up by the alpha male.
  • There is reward if the attack is successful – advancement up the pecking order. Success is likely if the alpha is confronted by an overwhelming force that he cannot hope to defend himself against. In the video, above, the alpha tries to get support from others (in this case, his harem) but it is not forthcoming and he flees. Power is usurped without injury.
  • No coalition against the alpha is neutral. It is the status quo.

This real-world case is almost identical to the classic abstract game theory problem of the ‘prisoner’s dilemma’. Mapping the two together:

  • If prisoners A and B cooperate, there is a reasonable chance of usurping the alpha, at some personal risk of injury to A and B.
  • If prisoner A betrays B, B suffers, at no personal risk to A.
  • If neither A nor B cooperates, there is continued domination by alpha.


For conscious cooperation, each agent needs to build a model of the other, to try to predict what the other will do – to imagine their future behaviour. Agent A will cooperate with B to perform task X because:

  • A wants the benefit of task X.
  • A predicts that B wants the benefit of task X.
  • A predicts that B will predict that A wants to benefit from task X.

But, as well as recognizing the opportunity and mutual desire to cooperate, cooperation must be initiated somehow.

  • A predicts that doing Y will make B predict that A wants to benefit from task X.

This requires rather sophisticated cognitive capabilities. The video (above) shows a well-known experiment from the 1930’s of cooperating chimpanzees. Note that the ‘cooperative’ activity is driven by one chimpanzee coercing the other (the hungrier coercing the less).

But cooperation is common within the animal kingdom. Deliberation by high-level processes (‘conscious’ deliberation?) is not required. Some strategies are so simple that they can evolve in low-level processes (i.e. in simple agents or at the ‘emotional’ level in more complex ones). For example, ‘Tit-for-tat’  is a simple but effective game theory strategy – with actions based only on the most recent behaviour:

  • If the other cooperated last time, cooperate this time.
  • If the other was uncooperative last time, do not cooperate this time.

After a few iterations of cooperation-versus-non-cooperation decisions, the behaviour will have settled into either permanent cooperation with cooperative agents or permanent non-cooperation with uncooperative agents. There is no conscious need to recognize either the benefits of cooperating or the punishment of non-cooperation.

Non-cooperation and Punishment

Where the social cognition of agents is higher, more elaborate cooperation is possible. But there is still no need for the explicit punishment or retribution for transgressions.

Chimpanzee alphas get to their position as a result of social abilities as much as strength – on their ability to build alliances, both among the males (to help defend his position against challengers) and the females (to prevent them from deserting him). But this argument about alliance-building applies all the way down the hierarchy. A group in which agents are continually cooperating to help them assert dominance (solely for their own ends) self-organizes into a hierarchy.

Those that have, in ‘Prisoner’s Dilemma’ terms, ‘defected’ are no longer trustworthy. Their presence is a source of anxiety for conformers. There is no need for a ‘sense of justice’ among those betrayed but there can be ‘pre-emption’ via the mechanisms of ‘the minimization of surprise through action and perception’: future surprises can be reduced by dominating the transgressor (possibly through an alliance with others). Those that fail to ‘do the right thing’ suffer the consequences in terms of loss of position and reputation.

So, social transgressions hardly need to be major for ‘justice’ to be administered by a mob. Not engaging in cooperative practices such as grooming makes a loner in the group different, like an outsider, and hence raises anxiety.

Punishment as severe as the infliction of mortal wounds within a group is rare among chimpanzees but the video (above) shows one example. The victim failed to play the social game of engaging with others and forming alliances.

61: Institutional Trust

Cultural Serendipity

Robert Sapolsky has reported how the alpha males of a troop of baboons attacked another troop and stole their food. But the food was contaminated and they died from tuberculosis. Without the alphas, the troop became more peaceful with fewer confrontations. New adolescent males joining from neighbouring troops (to find a mate; this is the opposite way around to chimpanzees) fitted in with the new culture (fitted into ‘the way we do things around here’). A random event developed a culture which has now persisted for over 20 years.

Culture progresses as a result of a series of fortunate events and/or the evolution of institutions. The growth of trust in a society can take centuries to evolve. Yet can be destroyed so quickly. They cannot just be installed into a society or re-instated if lost. They are like living organisms.


So far, I have looked how the ‘social network’ of animals (including ourselves) has extended from a pair-bond caring for its offspring, through the alloparenting of the extended family to the establishment of herds and packs. But these herds/packs are highly hierarchical. There is the need to remember the status of everyone else. Even with simple cooperation strategies like ‘tit-for-tat’ there is the need to remember the most-recent history of experience with everyone else in the group.

Better strategies can expand this further, but the size of the pack is ultimately limited by the cognitive abilities of the individuals (see ‘Dunbar’s number’). Male strangers (i.e. from outside the group) have no reputation that comes with them and are therefore likely to represent a danger.

Within a group, smaller bands can form as a (generally temporary) coalition for some end, perhaps:

  • The alpha and some betas patrolling the group’s territorial boundary, possibly attacking and even killing solitary males they may encounter (adolescent females leave their own group and are accepted into a new group),
  • Hunting prey (performing various roles as part of herding prey into an ambush), or
  • Some betas ganging up against the alpha.

(Aside: All these coalitions can be seen in terms of ‘pre-empting surprise through action and perception’ and they reduce anxiety.)

In human populations, these bands (or ‘gangs’) can be longer lasting. Within a gang, individuals behave like one another. Hence: the reputation of one band member can be inferred from that already known about one or more of the others. This enables a community to expand beyond the Dunbar number. There is reputation by affiliation. Institutions can evolve. Strangers can affiliate to an institution to gain a reputation. It is important for its members that the institution maintain its reputation. Hence it must sanction members who transgress. Outsiders can then cooperate with members with some degree of confidence. And it works the other way around too. Trade is a very important institution; individuals from communities that trade with neighbours come into contact with strangers more frequently than those from more isolated communities and are more likely to be accepting of those strangers. There is a virtuous circle.


“trust is not so much a relation between the individuals engaged in the transaction as vested in the institution that has established itself as trustworthy.”

Modern humans do not partake the finger-sucking or testicle holding of our ape cousins in order to build trust and do not (generally) attack strangers they encounter. Yet they make themselves vulnerable to strangers with ease – such as when allowing strangers to drill and cut them – on dental and hospital visits.

We are no longer playing tit-for-tat – we do not need to punish ‘defectors’ directly because they know that the institution will do this for them, in the interests of all its other members. Ultimately, we end up with judicial and political institutions.

There is a huge amount of detail that I have glossed over here in the expansion of the social network from about 100 of the other great apes to that of humans, approaching 10 billion. This expansion of trust allows me to travel to communities and engage with strangers there in countries half way around the world without particular fear for my safety (alas, not all countries).

62: Ethics, Ethology and Ethnography

Churchland’s ‘Braintrust’ thesis is that the effects of the Oxytocin and Vaspressin neurotransmitters in the brain, namely long-term pair bonding and care of offspring, are one of the most significant factor in the development of morality. From this establishment of the family, larger and larger groupings can develop – a progression from caring individuals to trusting societies:

  1. The caring of offspring by the mother.
  2. The long-term pair bonding – the father staying around.
  3. The caring of young by close kin e.g. aunties: ‘alloparenting’.
  4. The evolution of an enlarged community with hierarchy, trading violence for anxiety.
  5. The reciprocity on minor tasks building up trust for larger cooperative tasks e.g. grooming and cooperative hunting.
  6. The development of a local ‘culture’ of ‘the way we do things around here’, passed down from one generation to the next.
  7. The punishment of those that do not conform to the established culture.
  8. The development of institutions which permit the growth of trust with others.
  9. The foundation of large advanced civilizations in which there is general trust between strangers within society.

It is debatable at which stage morality becomes possible and any morality at an earlier stage will take a different form from that of the last. The step from stage 7 to 8 was a big leap in the story and is the step that has been uniquely human. But since we are animals, we should still look to prior stages in our development as an indication of our pre-moral ‘state of nature’ which needs to be taken into account (overcome?) in any moral system.

The step from stage 8 to 9 has taken place with negligible genetic development. We are of the same physical construction as 10,000 years ago. And this been done in spite of seemingly counter-acting prior evolutionary developments such as hierarchical aggression (stage 4) which does not align with our understanding of what is moral. Hierarchy still plays a large role within our modern morality. And since there is nothing in our genes to ensure it, there is nothing guaranteed about this emergence of large societies. It is cultural and culture is easily destroyed.

At the start of this progression are animals hiding in solitary safe-havens (of the ‘dark room’), from which all excursions are fraught with danger. At the end is modern human society, with a vastly expanded domain of predictability and lower anxiety. This cursory journey through

  • Ethology: the study of the behaviour of animals (especially in a social context), and
  • Ethnography: the study of human cultures.

has been a dry, ‘mechanical’ account. There has been no mention of another ‘E’: Empathy. There was the (Oxytocin-induced) care within the immediate family at the start but there has been nothing emotive since then.

Empathy is what I turn to next.

Posted in Uncategorized | Tagged , , , | 3 Comments