Training The Multi-Layer Perceptron

Training Method

A multi-layer perceptron is trained by presenting it with an input pattern for which the correct response in known and then observing the output. The difference between correct and observed outputs is the network's overall output error. The neural connection weights within the network are then adjusted to compensate for this measured error.

This is repeated for a large number of known examples derived from historic information about the kind of problem the network is to solve.

Output Error Vector

The input and output patterns are each made up of a set of scalar values. Each scalar value can be imagined as an electrical voltage on one of the pins of an input or output connector. Each scalar value is independent of the others. This independence may be represented pictorially by drawing them at right-angles to each other in vector-space:

The error vector E above is the difference between the target (or correct) output vector T and the actual output O when a particular input pattern is presented to the network.

[To show the 5-channel input pattern pictorially we would have to draw a graph in 5-dimensional vector space which is rather hard to visualise.]
As can be seen from the above diagram, a vector has both a magnitude and a direction. The direction is denoted by the angle a in each case. In training the network our quest is simply to minimise the magnitude of the error vector E: its direction is immaterial.

The magnitude E of this vector is given by Pythagoras' Theorem:

In general, for J outputs, the magnitude E of an error vector E is given by the summation below where E is a vector in J-dimensional vector-space:

Output Error Function

To correct large output errors quickly, while gradually easing small errors gently towards zero, we can define an error function which is proportional to the square of the actual error. The error function F thus rises ever more rapidly as the errors in the individual output signals increase:

Furthermore, we know that F will always have a positive magnitude. It will therefore be easier to find the point at which it is closest to zero.

Conveniently, the general formula above gives the magnitude of the error vector already squared. Therefore, substituting for E we get:

Our task is to minimise F for every input pattern in the training data file.

Weight Adjustment

While a given fixed input pattern P is being presented to the network, the output (and consequently the error function) can be varied by varying one or more of the weights of the inter-neural links:

For a fixed input pattern, a change in any one of the weights produces a change in the output pattern. Since the target output pattern is also fixed, a change in any one of the weights produces a corresponding change in the error function F. We can regard F therefore as a function of all the weights in the network, ie.

where 'n' is the number of weights in the network. Holding all the others constant we can therefore plot how varying one of the weights, w causes the error function F to vary:

For the current values of all other weights, there should be a value for this weight for which F is a minimum. There could however be other false minima as illustrated above. When adjusting the weights we must try not to get trapped in one of these false minima.

What we need to know is the direction in which we must we adjust a given weight in order to reduce F. Must we increase it or decrease it?

Direction of Adjustment

To find out we must determine which way the curve is sloping at the point on the curve corresponding to the current value of w:

You can see above that for a weight value of w1 we must increase w to reduce F, and that for a weight value of w2 we must decrease w to reduce F. Each time we adjust one of the network's weights we must therefore 'add' to it an amount Dw which is opposite in sign to the slope of the curve at the point corresponding to the current value of that weight. But by how much? How big do we make Dw?

Size of Adjustment

Because F is the square of E we know it will always be positive. We also know that generally, anything squared accelerates in its rate of increase as it gets larger, for example:

If we make the size of Dw proportional to the slope of the curve of F vs w at the point corresponding to the current value of w this will make Dw large when F is far from a minimum and small when it is near to a minimum. So w is corrected rapidly when the error is large and slowly when the error is small. With luck this will make the correction process leap over any false minima and settle precisely on the true minimum. But it can't guarantee it.

To adjust each of the network's weights so as to reduce F for a presented input pattern from the training data we must do what in effect is done by the following 'C' statement:

w += Delta_w;    //add the weight-increment to the weight
The problem now is to find a viable method of computing ðF/ðw for any weight w in the network.

Computing ðF/ðw

To do this we must start with what we know and work back to what we don't know. For each presented input pattern P we know what the output T should be: this is provided in the training data file. We can measure O, the observed output. From these we can compute directly the magnitude of the error function F using the formula:

where J is the number of output channels (which is the same as the number of neurons in the output layer).

Output Error

The contribution to the value of the error function F of a small error do on 'output pin' j (ie the change in the error function F per unit change in the signal on one of the network's scalar outputs) will therefore be:

After differentiation, the terms of the summation are zero for all except one value of j. The above is therefore the slope of the curve of F vs one output value o on a particular output pin j with all the other outputs held constant. To make the maths as simple as possible, we take a little mathematical licence and make C = 1/2 so the above equation becomes:

From the training data file, we know what the output signal t should be on a given output pin j for a given sample input pattern. We can observe the corresponding signal o which the network actually puts out.

The equation above therefore provides us with a measure of how an error in one of the network's output signals (an error in the output from a neuron in the output layer) contributes to the network's output error function F.

Activation Error

Having found how the output 'o' from a single output neuron contributes to the network error pattern, we now take a step back to see how this translates into an error in the neurone's activation level 'a'. This is simply how the error in 'o' back-propagates through the neurone's sigmoid function:

The way an error da in the activation level 'a' contributes to the network error is therefore given by:

Sigmoid Function

So we now need to find the first derivative do/da of the sigmoid function. Below the sigmoid function is shown with the variables a and o replaced by the more familiar variable names x and y:

Add corresponding small increments dy and dz to y and z:

Substitute the previous formula for y to get an expression for dy alone:

The first derivative is in effect the slope of the graph of the function at any given point which is dy/dz. Dividing both sides of the above formula by dz:

As dy and dz is made smaller and smaller while preserving their ratio, the terms which are multiplied by dz become insignificantly small compared with those that are not. Therefore the above formula becomes:

Now there is a thing called a chaining rule which we will not go into here which states that differential operators behave exactly like algebraic variables in the sense that:

The whole point of the natural number e is that the function e raised to the power x always has the same value as its derivative, ie:

So by the chaining rule:

Since we have already computed y itself, it is convenient to express the derivative in terms of y rather than in x which is more complicated.

Therefore:

Run the program SIGDASH.EXE in the \SIGMOID directory on the Neuron Diskette to generate and display the graph of y = f '(x).

Finally, substituting our 'neural' variable names, we get the sigmoid's first derivative:

So the change in the network error F resulting from a unit change in the activation level of any neuron is:

Summation Function

We now take a second step further back through the network to see how an error dw in one of a neurone's input weights 'w' translates into an error da in the neurone's activation level 'a'.

To help us to visualise this, we represent the weight 'w' by an electrical rheostat which attenuates the input signal 'i'. An error dw in this weight is therefore represented by a small error in the setting of this rheostat.

The input signal to the neuron under consideration from a neuron 'i' in the previous layer is i.w. Therefore the error e in this input signal must be i.dw. The consequential error da in the neurone's activation level caused by dw must therefore be e / NI, where NI is the number of inputs to this neuron.

Putting these together:

We now have all the information we need to determine how an error dw in a neurone's input weight affects the total network error function F.

By the chaining rule:

Substituting for ðF/ða and ða/ðw:

However, we can only know directly the value of 'o' and ðF/ðo is for neurons in the output layer:

Finding 'o' and ðF/ðo is for neurons in a hidden layer is more complicated.

ðF/ðo for Hidden Neurons

The error in the output of a hidden neuron affects all the network's output signals via its links to the neurons in the next layer. We need therefore to consider the error in the output of a neuron in a hidden layer.

Consider Neuron 0 in the hidden layer just before the output layer. We will refer to this layer as Layer 'j' and the output layer as Layer 'k' as shown below:

The change da in the activation level of Neuron k=0 caused by a change in the output do of Neuron j=0 is do times the weight w on the connecting link between them. Since we will be implementing our network using 16-bit interger arithmetic we must divide this by NI, the number of inputs to the neurons in Layer 'k', ie:

The error in the activation level of Neuron k=0 per unit error in the output of Neuron j=0 is therefore:

In general, therefore, for any link between Layer 'j' and Layer 'k' the above expression becomes:

What we are looking for is the contribution to the value of the network output error function F produced by the error in the output of Neuron j=0, namely:

By the chaining rule:

because the errors in both the links contribute to the error in the network's output. In general, therefore, for a network with an output layer containing K neurons:

ðF/ða for Hidden Neurons

Next we have to find how this error in a hidden neurone's output translates into an error in its activation level. Using the chaining rule and substituting for ðF/ða and ða/ðw (as we did previously for an output neuron) we get:

Note: the index k is a different variable from the k in the second brace.

Hidden Neurone's Inputs

We must now back-track further to consider the weights on the inputs to Neuron j=0:

We must find how the error dw in the weight w (see above) on one of Neuron j=0's input links affects the neurone's output. The internal function of hidden neurons is exactly the same as that of the neurons in the output layer. The formula for finding how an error in one of its input weights affects it output is therefore the same, namely:

ðF/ðw for Hidden Neurons

We now have all the information we need to determine how an error dw in a hidden neurone's input weight affects the total network error function F. So by the chaining rule:

Rationalisation

We need two separate expressions for computing ðF/ðw:
for output neurons:

and for hidden neurons:

Adjusting Output Weights

When a pattern is presented to the network, the network function mlp() computes and stores the outputs of all the neurons in the network starting with the first hidden layer and progressing through to the output layer. We therefore know o and i for every neuron in the network.

In the case of output layer neurons therefore we have all we need directly to compute ðF/ðw for each input weight of each neuron and 'add' to it the appropriate amount:

So before considering any of the hidden layers we will go ahead and adjust the input weights to the output layer.

Adjusting Hidden Weights

We also know o and i for hidden neurons and we having just adjusted the input weights to the output layer, we know these also.

But notice that the summation term in the 'hidden' formula needs the value of ðF/ða for each neuron in Layer 'k', the output layer (assuming for the moment that we are considering the hidden layer immediately before the output layer). We must therefore save the output layer's values of ðF/ða while we have them ie while we are adjusting the output layer's weights. Then we can pick them up when we come to process the hidden layer.

BACK-PROPAGATION

This process of computing and expediting the weight adjustments for the output layer and then using this for computing and expediting the weight adjustments for the hidden layer behind it called back-propagation.

Similarly, while computing and expediting the weight adjustments for the hidden layer immediately behind the output layer, we save its values of ðF/ða namely:

for when we come to compute and expedite the weight adjustments for the hidden layer immediately behind that one. And so on until we get to the first hidden layer in the network namely the one next to the input.

Training Algorithm

The training algorithm for adjusting the weights in response to the error F which results when a given training pattern P is processed by the mlp() function is shown in a kind of pseudo 'C' below:
For each layer of the network 
(working backwards from the output):
{
  for each neuron in the layer
  {
    compute ðF/ða;
    store it for use next pass of the loop;

    for each input weight to the neuron
    {
      compute ðF/ðw;
      w -= h * ðF/ðw;      //adjust the weight
    }
  }
}
In practice, faster code can be produced by splitting the computation of ðF/ða between successive passes of the loop. Full implementation in 'C' of this training algorithm is covered in the document TRAIN2.WRI.
This page's parent within this Web Site. About this Web Site. Its home page. Email its Author.