### The Fall of Artificial Neural Networks: XOR gates

In the 1969 book *‘Perceptrons: an introduction to computational geometry’**,* Marvin Minsky and Seymour Papert demonstrated that single-layer Artificial Neural Networks could not even implement an XOR (‘exclusive or’) logical function. This was a big disappointment. In the history of Artificial Neural Networks, this is seen as a significant contributor to the ‘AI winter’ of reduced interest (and hence also of reduced funding) in them.

### The Rise of Artificial Neural Networks: Back-Propagation

The backpropagation algorithm effectively solved the exclusive-or problem in that:

- To implement XORs required one or more hidden layers in the network (between the inputs and the output layer).
- The backpropagation algorithm enabled multi-layer networks to be trained.

This contributed to a resurgence of interest in Artificial Neural Networks. Backpropagation was invented independently a number of times, most notably by Paul Werbos (1974), Rumelhart, Hinton and Williams (1986) and Yann LeCun (1987).

Watch Victor Lavrenko’s Youtube for more technical details on the XOR problem…

### The Backpropagation Code

The purpose of this post is to provide example code for the backpropagation algorithm and demonstrate that it can be solve the XOR problem.

As noted elsewhere:

- The code here is unapologetically ‘unpythonic’.
- If you do not have a Python application installed, you can open the online https://repl.it/languages/python3interpreter in a new window and use that. All code fragments are combined at the end of this piece into a single listing that can be copy-pasted into the interpreter.

As well as being unpythonic, the code here differs from typical implementations in that it can handle more than 2 layers. The code can be configured for any full-connected feed-forward network of any number of layers greater than 1 and any number of neurons for each layer.

### Some Housekeeping

Firstly, let’s sort out some housekeeping. Here are 2 functions so that:

- We can pause the run to see things before they disappear off the top of the screen. (We can stop if we type ‘n’)
- We can control how much information gets printed out by varying a ‘verbosity’ variable value.

def prompted_pause(s): import sys ok = input(s) if ok=="n" or ok=="N": print("Stopping here") sys.exit() verbosity = 1 def print_info(v,s, end="DEFAULT"): if verbosity >= v: if end == "DEFAULT": print(s) # With newline elif end == "": print(s, end="") # Without newline else: print(s, end)

Where there is a call to **print_info(3, “Blah”)**, the **3** means that the message **“Blah”** will only get printed out if the verbosity level is 3 or more. Across the whole program below, verbosity levels are such that:

- If
**verbosity**is set to 1, it will only print out the minimal. - If
**verbosity**is set to 2, it will only print out more. - If
**verbosity**is set to 3, it will only print out the minimal.

### The Application

The neural network will be trained to behave like a ‘full adder’ circuit. This is a common building block in digital electronic circuits. It adds up three 1-bit numbers to produces a 2-bit output number (range 0…3). The ‘CI’ and ‘CO’ signals are the carry-in and carry-out respectively. As an example application, by chaining 32 of these circuits together (connecting the CO output of one full adder to the CI input of another) we get a circuit that adds two 32-bit numbers together.

The full adder has been chosen because:

- It contains at least one XOR gate (it has 2), to demonstrate that a multilayer network can learn this non-linearly-separable behaviour, and
- It has more than one output (it has 2), to provide Python code that is more generalised.

This is not a good example of what a neural network could be used for. Here, there are only 8 possible combinations of inputs. Any 3-in 2-out (combinatorial) function can be defined with just 16 bits of information.

The training set is the same as the test set and just defines the LINK truth table of a full adder. After training, the network will be tested against all the 8 possible input combinations. A more appropriate application is where then number of possible input combinations is much greater than the number of vectors it can be trained and tested against.

# Full Adder example: Training_Set = [ # A B CI S CO [[0, 0, 0], [0, 0]], [[0, 0, 1], [0, 1]], [[0, 1, 0], [0, 1]], [[0, 1, 1], [1, 0]], [[1, 0, 0], [0, 1]], [[1, 0, 1], [1, 0]], [[1, 1, 0], [1, 0]], [[1, 1, 1], [1, 1]] ] # Bit assignments... SUM = 1 CARRY = 0

For example, there are 2 bits set to 1 in the input **[0, 1, 1]** so the sum is 2 which is binary ‘10’, so the output vector is **[1, 0]**.

### A Neuron

We now define the code for a single neuron. For each neuron, we need to have:

- A list of weights, one for each neuron input (from the layer below).
- A bias – this behaves in the same way as a weight except the input is a constant ‘1’.
- A gradient
**∂E/∂z**. This is used for training the network.

The **FeedForward_Neuron** function calculates a neuron’s output **y**, based on its inputs **x**, and its bias **b**. A sum-of-products is formed:

**z = Σw _{i}.x_{i} + b**

and the output y is derived from that using the **Sigmoid** function:

**y = σ(z)**

The **Sigmoid** function provides the non-linearity that allows the network to learn non-linear relationships such as the ‘XOR’ function (compare this with the simple, linear network in ‘Fish, Chips, Ketchup’).

import math class Neuron: def __init__(self, bias): self.B = bias self.W = [] self.dEdz = 0.0 """ The logistic function """ def Sigmoid(z): return 1 / (1 + math.exp(-z)) """ Generate neuron output from inputs""" def FeedForward_Neuron(inputs, bias, weights): z = bias for i in range(len(inputs)): z += inputs[i] * weights[i] return Sigmoid(z)

We start with the list of weights being empty; we will fill these in as we build up the network of neurons.

As in previous posts, the code is not Pythonic. Here, there are no ‘def’ functions (‘methods’) defined within any **class**. All functions are outside which means they require **all** the information used to be passed as parameters to the function. This is to make it clear what the dependencies are. Python examples of back-propagation available elsewhere on the interweb will used classes properly and use vector operations where I have used **for** loops.

### A Layer of Neurons

A neuronal layer is then just an array of neurons. The biases and weights of the neurons all get initialized to random values before any training is done.

Updating a neuronal layer is just updating each neuron in turn.

import random class NeuronLayer: def __init__(self, num_neurons, num_inputs): self.Neuron = [] # Build up a list of neurons for n in range(0, num_neurons): print_info(3," Neuron[%d]" % (n)) # Add a neuron to the layer, with a random bias self.Neuron.append(Neuron(random.random())) print_info(3," Bias = %.3f" % self.Neuron[n].B) # Give it random weights for i in range(0, num_inputs): self.Neuron[n].W.append(random.random()) # Initialized randomly print_info(3," Weight[%d] = %.3f" % (i, self.Neuron[n].W[i])) def FeedForward_Layer(inputs, layer): outputs = [] for neuron in layer.Neuron: neuron.X = inputs y = FeedForward_Neuron(neuron.X, neuron.B, neuron.W) neuron.Y = y outputs.append(y) return outputs

### The Neural Network

A complete multilayer network can then be created, with:

- a particular number of inputs and outputs,
- a particular number of layers
- a particular number of neurons in each layer.

And we can use this network by feeding it input signals which propagate up through the layers to then return the outputs.

The application calls for 3 inputs and 2 outputs. Typically, the number of layers is 2 but you can configure for more than this (for so-called ‘deep’ networks). Here, we configure the network as follows:

num_inputs = 3 num_outputs = 2 num_neurons_in_layer = [4, num_outputs] # num. neurons in each layer from inputs up to output # The num. neurons in the top (output) layer is the same as the num. output ports output_layer = len(num_neurons_in_layer)-1 # Layer number

The **num_neurons_in_layer** variable defines the number of layers as well as the number of neurons in each layer. You can experiment with the number of neurons.

To actually create the network, we use:

Net = [] for L in range(len(num_neurons_in_layer)): if L==0: # Input layer i = num_inputs else: i = num_neurons_in_layer[L-1] # (Fully connected to lower layer) print_info(1, "Create layer %d with %d neurons and %d inputs" % (L, num_neurons_in_layer[L], i)) Net.append(NeuronLayer(num_neurons = num_neurons_in_layer[L], num_inputs = i))

### Feed-Forward

For actual usage, we just apply the inputs then update each layer in turn from the input layer **forward** to the output layer.

def FeedForward_Net(inputs, Net): for L in range(len(Net)): # Up through all layers print_info(3, " Feed-Forward layer Net[%d]" % L) if L==0: y = FeedForward_Layer(inputs, Net[L]) else: y = FeedForward_Layer(y, Net[L]) return y

### Testing

In the trivial example here, we test the network by applying all input combinations, as defined in the truth table training set.

def Test_Network(Net, Training_Set): print("Test Network:") for i in range(8): Training_Input, Training_Output = Training_Set[i] print(" %d+%d+%d" % (Training_Input[0], Training_Input[1], Training_Input[2]), end="") result = FeedForward_Net(Training_Input, Net) rounded_result = [round(result[0]), round(result[1])] print(" = %d%d" % (rounded_result[CARRY], rounded_result[SUM]), end="") print(" (%.3f, %.3f)" % ( result[CARRY], result[SUM]), end="") if rounded_result == Training_Output: print(" correct") else: print(" bad")

Not surprisingly, the network does not behave as desired before it is trained. There will be just a 50:50 chance that the output will be correct.

Test_Network(Net, Training_Set)

### Training with Back-Propagation

Now we want to train the network to behave according to the application (in this case, to behave like a full adder circuit). We train using the ‘back-propagation’ algorithm. This involves:

- Applying the inputs for a particular training set and propagate these forward to produce the outputs.
- Seeing how the outputs differ from what you want them to be (what the training set outputs say they should be). The mismatch is called the ‘error’
**E**. - For each neuron in the output layer, working out what change to the signal
**z**(remember:**z = Σw**and_{i}.x_{i}+ b**y = σ(z)**) would be needed to make the output correct (i.e. make**E**=0). This is**∂E/∂z**, the ‘*partial derivative*’ of the error with respect to**z**. - For each layer working from that output layer
**back**to the input layer, repeat the above operation for each neuron. Setting**∂E/∂z**will require using the weights and**∂E/∂z**values of the neurons of the higher layers. (We are**propagating**the error derivatives back through the layers.) - We update the weights of each neuron by deriving a partial derivative of the error with respect to the weight
**∂E/∂w**(derived from the**∂E/∂z**values already calculated). We adjust each weight by a small fraction of this change, determined by the*‘learning rate*‘)**ε**so that a weight**w**is changed to become**w+****ε**.**∂E/∂w**. We do the same with the biases**∂E/∂b**.

We perform the above operations for each item in the training set in turn. Over many iterations, the weights converge on values that produce the desired behaviour (hopefully); this is called ‘gradient descent’.

As an example, consider how to modify the weight **w _{12}** that connects neuron

**n**in layer 1 to a neuron

_{1}**n**in layer 2 where this is in a 3-layer network. The error derivatives

_{2}**∂E/∂z**in the higher layer, 3, have already been calculated.

_{3}The error derivative for **n _{2}** is calculated using the ‘chain rule’, multiplying the derivatives of everything along the signal path:

**∂E/∂z _{2}** =

**∂E/∂z**.

_{3}. ∂z_{3}/∂y_{2}. ∂y_{2}/∂z_{2}This result is used for both:

- Continuing to propagate back to lower layers (in this case, just layer 1), and
- Calculating the derivative for the weight adjustment:

**∂E/∂w _{12}** =

**∂E/∂z**.

_{2}. ∂z_{2}/∂w_{12}For more details, I recommend Matt Mazur’s…excellent worked-out example and Pythonic Python code.

def calc_dEdy(target, output): return -(target - output) # Derivative of the sigmoid function: def dydz(y): return y * (1 - y) def calc_dEdz(target, output): # Are these vectors or scalars? return calc_dEdy(target, output) * dydz(output); def dzdw(x): return x # z=sum(w[i].x[i]) therefore dz/dw[i]=x[i] LEARNING_RATE = 0.5 # Often denoted by some Greek letter - often epsilon # Uses 'online' learning, ie updating the weights after each training case def Train_Net(Net, training_inputs, training_outputs): # 0. Feed-forward to generate outputs print_info(2," Feed-forward") FeedForward_Net(training_inputs, Net) for L in reversed(range(len(Net))): # Back through all layers print_info(2," Back-prop layer Net[%d]" % (L)) if L == output_layer: # Output layer # 1. Back-propagation: Set Output layer neuron dEdz for o in range(len(Net[L].Neuron)): # For each output layer neuron print_info(3," Back-prop Net[%d].Neuron[%d]" % (L, o)) print_info(3," %d" % (training_outputs[o])) print_info(3," calc_dEdz(%.3f, %.3f)" % (training_outputs[o], Net[L].Neuron[o].Y)) Net[L].Neuron[o].dEdz = calc_dEdz(training_outputs[o], Net[L].Neuron[o].Y) else: # 2. Back-propagation: Set Hidden layer neuron dE/dz = Sum dE/dz * dz/dy = Sum dE/dz * wih for h in range(len(Net[L].Neuron)): print_info(3," Back-prop Net[%d].Neuron[%d]" % (L, h)) dEdy = 0 for output_neuron in range(len(Net[L+1].Neuron)): dEdy += Net[L+1].Neuron[output_neuron].dEdz * Net[L+1].Neuron[output_neuron].W[h] Net[L].Neuron[h].dEdz = dEdy * dydz(Net[L].Neuron[h].Y) # 3. Update output layer neuron biases and weights: dE/dw = dE/dz * dz/dw for L in range(len(Net)): # Up through all layers print_info(2," Update weights in layer Net[%d]" % (L)) for n in range(len(Net[L].Neuron)): dEdb = Net[L].Neuron[n].dEdz * 1.0 # dE/db = dE/dz * dz/db; dz/db=1 (the bias is like a weight with a constant input of 1) Net[L].Neuron[n].B -= LEARNING_RATE * dEdb # db = epsilon * dE/db for w in range(len(Net[L].Neuron[n].W)): dEdw = Net[L].Neuron[n].dEdz * dzdw(Net[L].Neuron[n].X[w]) Net[L].Neuron[n].W[w] -= LEARNING_RATE * dEdw # dw = epsilon * dE/dw

We train the network until it is ‘good enough’. For that, we need a measure of how good (or how bad) the network is performing whilst we are training. That measure is derived by the **Total_Error** function. In this simple example, there are only 8 possible combinations of inputs so, at each training round, a training vector is randomly selected from the 8.

""" For reporting progress (to see if it working, or how well it is learning)""" def Total_Error(Net, training_sets): Etotal = 0 """ Use the first 8 training vectors as the validation set """ num_validation_vectors = 8 # There are only 8 vectors in the Full-Adder example for t in range(num_validation_vectors): training_inputs, training_outputs = training_sets[t] FeedForward_Net(training_inputs, Net) Etotal += 0.5*(training_outputs[0] - Net[output_layer].Neuron[0].Y)**2 Etotal += 0.5*(training_outputs[1] - Net[output_layer].Neuron[1].Y)**2 return Etotal Etotal = 0.0 for i in range(0, 100000): Training_Input, Training_Output = random.choice(Training_Set) if i%100==99: print_info(1,"Training iteration %d" % i, end="") print_info(3," %d+%d+%d" % (Training_Input[0], Training_Input[1], Training_Input[2]), end="") print_info(3," = %d%d" % (Training_Output[CARRY], Training_Output[SUM]), end="") print_info(1,"") Train_Net(Net, Training_Input, Training_Output) if i%100==99: Etotal = Total_Error(Net, Training_Set) print_info(1," Validation E = %.3f" % Etotal) if Etotal < 0.02: break

### Testing the Trained Network

Then we test the network again to see how well it has been trained.

Test_Network(Net, Training_Set)

With the error threshold to stop training fixed at 0.02, you can experiment with changing the size and depth of the network and seeing how many training iterations it takes to get to that error threshold.

An example output is given below – the beginning and end at least…

######################################### Create Neural Net ######################################### Create layer 0 with 4 neurons and 3 inputs Create layer 1 with 2 neurons and 4 inputs ######################################### Testing ######################################### Continue?y Test Network: 0+0+0 = 11 (0.834, 0.829) bad 0+0+1 = 11 (0.864, 0.851) bad 0+1+0 = 11 (0.877, 0.874) bad 0+1+1 = 11 (0.893, 0.886) bad 1+0+0 = 11 (0.857, 0.847) bad 1+0+1 = 11 (0.880, 0.864) bad 1+1+0 = 11 (0.890, 0.884) bad 1+1+1 = 11 (0.901, 0.893) correct ######################################### Training ######################################### Continue?y Training iteration 99 Validation E = 1.972 Training iteration 199 Validation E = 1.881 Training iteration 299 Validation E = 2.123 … Training iteration 10099 Validation E = 0.020 Training iteration 10199 Validation E = 0.020 Test Network: 0+0+0 = 00 (0.014, 0.080) correct 0+0+1 = 01 (0.029, 0.942) correct 0+1+0 = 01 (0.028, 0.946) correct 0+1+1 = 10 (0.972, 0.065) correct 1+0+0 = 01 (0.029, 0.936) correct 1+0+1 = 10 (0.980, 0.062) correct 1+1+0 = 10 (0.976, 0.061) correct 1+1+1 = 11 (0.997, 0.922) correct

### All together

Piecing all this code together so we have a single file to run…

print("#########################################") print("Reporting/control") print("#########################################") def prompted_pause(s): import sys ok = input(s) if ok=="n" or ok=="N": print("Stopping here") sys.exit() verbosity = 1 def print_info(v,s, end="DEFAULT"): if verbosity >= v: if end == "DEFAULT": print(s) # With newline elif end == "": print(s, end="") # Without newline else: print(s, end) """ ######################################### Application: Full adder ######################################### """ # Full Adder example: Training_Set = [ # A B CI S CO [[0, 0, 0], [0, 0]], [[0, 0, 1], [0, 1]], [[0, 1, 0], [0, 1]], [[0, 1, 1], [1, 0]], [[1, 0, 0], [0, 1]], [[1, 0, 1], [1, 0]], [[1, 1, 0], [1, 0]], [[1, 1, 1], [1, 1]] ] # Bit assignments... SUM = 1 CARRY = 0 print("#########################################") print("Create Neural Net") print("#########################################") import math class Neuron: def __init__(self, bias): self.B = bias self.W = [] self.dEdz = 0.0 """ The logistic function """ def Sigmoid(z): return 1 / (1 + math.exp(-z)) """ Generate neuron output from inputs""" def FeedForward_Neuron(inputs, bias, weights): z = bias for i in range(len(inputs)): z += inputs[i] * weights[i] return Sigmoid(z) import random class NeuronLayer: def __init__(self, num_neurons, num_inputs): self.Neuron = [] # Build up a list of neurons for n in range(0, num_neurons): print_info(3," Neuron[%d]" % (n)) # Add a neuron to the layer, with a random bias self.Neuron.append(Neuron(random.random())) print_info(3," Bias = %.3f" % self.Neuron[n].B) # Give it random weights for i in range(0, num_inputs): self.Neuron[n].W.append(random.random()) # Initialized randomly print_info(3," Weight[%d] = %.3f" % (i, self.Neuron[n].W[i])) def FeedForward_Layer(inputs, layer): outputs = [] for neuron in layer.Neuron: neuron.X = inputs y = FeedForward_Neuron(neuron.X, neuron.B, neuron.W) neuron.Y = y outputs.append(y) return outputs """ A complete multilayer network can then be created, """ # Configuration... num_inputs = 3 num_outputs = 2 num_neurons_in_layer = [4, num_outputs] # num. neurons in each layer from inputs up to output # The num. neurons in the top (output) layer is the same as the num. output ports output_layer = len(num_neurons_in_layer)-1 # Layer number Net = [] for L in range(len(num_neurons_in_layer)): if L==0: # Input layer i = num_inputs else: i = num_neurons_in_layer[L-1] # (Fully connected to lower layer) print_info(1, "Create layer %d with %d neurons and %d inputs" % (L, num_neurons_in_layer[L], i)) Net.append(NeuronLayer(num_neurons = num_neurons_in_layer[L], num_inputs = i)) def FeedForward_Net(inputs, Net): for L in range(len(Net)): # Up through all layers print_info(3, " Feed-Forward layer Net[%d]" % L) if L==0: y = FeedForward_Layer(inputs, Net[L]) else: y = FeedForward_Layer(y, Net[L]) return y print("#########################################") print("Testing") print("#########################################") prompted_pause("Continue?") def Test_Network(Net, Training_Set): print("Test Network:") for i in range(8): Training_Input, Training_Output = Training_Set[i] print(" %d+%d+%d" % (Training_Input[0], Training_Input[1], Training_Input[2]), end="") result = FeedForward_Net(Training_Input, Net) rounded_result = [round(result[0]), round(result[1])] print(" = %d%d" % (rounded_result[CARRY], rounded_result[SUM]), end="") print(" (%.3f, %.3f)" % ( result[CARRY], result[SUM]), end="") if rounded_result == Training_Output: print(" correct") else: print(" bad") Test_Network(Net, Training_Set) print("#########################################") print("Training") print("#########################################") prompted_pause("Continue?") def calc_dEdy(target, output): return -(target - output) # Derivative of the sigmoid function: def dydz(y): return y * (1 - y) def calc_dEdz(target, output): # Are these vectors or scalars? return calc_dEdy(target, output) * dydz(output); def dzdw(x): return x # z=sum(w[i].x[i]) therefore dz/dw[i]=x[i] LEARNING_RATE = 0.5 # Often denoted by some Greek letter - often epsilon # Uses 'online' learning, ie updating the weights after each training case def Train_Net(Net, training_inputs, training_outputs): # 0. Feed-forward to generate outputs print_info(2," Feed-forward") FeedForward_Net(training_inputs, Net) for L in reversed(range(len(Net))): # Back through all layers print_info(2," Back-prop layer Net[%d]" % (L)) if L == output_layer: # Output layer # 1. Back-propagation: Set Output layer neuron dEdz for o in range(len(Net[L].Neuron)): # For each output layer neuron print_info(3," Back-prop Net[%d].Neuron[%d]" % (L, o)) print_info(3," %d" % (training_outputs[o])) print_info(3," calc_dEdz(%.3f, %.3f)" % (training_outputs[o], Net[L].Neuron[o].Y)) Net[L].Neuron[o].dEdz = calc_dEdz(training_outputs[o], Net[L].Neuron[o].Y) else: # 2. Back-propagation: Set Hidden layer neuron dE/dz = Sum dE/dz * dz/dy = Sum dE/dz * wih for h in range(len(Net[L].Neuron)): print_info(3," Back-prop Net[%d].Neuron[%d]" % (L, h)) dEdy = 0 for output_neuron in range(len(Net[L+1].Neuron)): dEdy += Net[L+1].Neuron[output_neuron].dEdz * Net[L+1].Neuron[output_neuron].W[h] Net[L].Neuron[h].dEdz = dEdy * dydz(Net[L].Neuron[h].Y) # 3. Update output layer neuron biases and weights: dE/dw = dE/dz * dz/dw for L in range(len(Net)): # Up through all layers print_info(2," Update weights in layer Net[%d]" % (L)) for n in range(len(Net[L].Neuron)): dEdb = Net[L].Neuron[n].dEdz * 1.0 # dE/db = dE/dz * dz/db; dz/db=1 (the bias is like a weight with a constant input of 1) Net[L].Neuron[n].B -= LEARNING_RATE * dEdb # db = epsilon * dE/db for w in range(len(Net[L].Neuron[n].W)): dEdw = Net[L].Neuron[n].dEdz * dzdw(Net[L].Neuron[n].X[w]) Net[L].Neuron[n].W[w] -= LEARNING_RATE * dEdw # dw = epsilon * dE/dw """ For reporting progress (to see if it working, or how well it is learning)""" def Total_Error(Net, training_sets): Etotal = 0 """ Use the first 8 training vectors as the validation set """ num_validation_vectors = 8 # There are only 8 vectors in the Full-Adder example for t in range(num_validation_vectors): training_inputs, training_outputs = training_sets[t] FeedForward_Net(training_inputs, Net) Etotal += 0.5*(training_outputs[0] - Net[output_layer].Neuron[0].Y)**2 Etotal += 0.5*(training_outputs[1] - Net[output_layer].Neuron[1].Y)**2 return Etotal Etotal = 0.0 for i in range(0, 100000): Training_Input, Training_Output = random.choice(Training_Set) if i%100==99: print_info(1,"Training iteration %d" % i, end="") print_info(3," %d+%d+%d" % (Training_Input[0], Training_Input[1], Training_Input[2]), end="") print_info(3," = %d%d" % (Training_Output[CARRY], Training_Output[SUM]), end="") print_info(1,"") Train_Net(Net, Training_Input, Training_Output) if i%100==99: Etotal = Total_Error(Net, Training_Set) print_info(1," Validation E = %.3f" % Etotal) if Etotal < 0.02: break """ See how it behaves now, after training. """ Test_Network(Net, Training_Set)