
This is repeated for a large number of known examples derived from historic information about the kind of problem the network is to solve.

The error vector E above is the difference between the target (or correct) output vector T and the actual output O when a particular input pattern is presented to the network.
[To show the 5-channel input pattern pictorially we would have to draw a graph in 5-dimensional vector space which is rather hard to visualise.]As can be seen from the above diagram, a vector has both a magnitude and a direction. The direction is denoted by the angle a in each case. In training the network our quest is simply to minimise the magnitude of the error vector E: its direction is immaterial.
The magnitude E of this vector is given by Pythagoras' Theorem:

In general, for J outputs, the magnitude E of an error vector E is given by the summation below where E is a vector in J-dimensional vector-space:


Furthermore, we know that F will always have a positive magnitude. It will therefore be easier to find the point at which it is closest to zero.
Conveniently, the general formula above gives the magnitude of the error vector already squared. Therefore, substituting for E we get:

Our task is to minimise F for every input pattern in the training data file.

For a fixed input pattern, a change in any one of the weights produces a change in the output pattern. Since the target output pattern is also fixed, a change in any one of the weights produces a corresponding change in the error function F. We can regard F therefore as a function of all the weights in the network, ie.

where 'n' is the number of weights in the network. Holding all the others constant we can therefore plot how varying one of the weights, w causes the error function F to vary:

For the current values of all other weights, there should be a value for this weight for which F is a minimum. There could however be other false minima as illustrated above. When adjusting the weights we must try not to get trapped in one of these false minima.
What we need to know is the direction in which we must we adjust a given weight in order to reduce F. Must we increase it or decrease it?

You can see above that for a weight value of w1 we must increase w to reduce F, and that for a weight value of w2 we must decrease w to reduce F. Each time we adjust one of the network's weights we must therefore 'add' to it an amount Dw which is opposite in sign to the slope of the curve at the point corresponding to the current value of that weight. But by how much? How big do we make Dw?

If we make the size of Dw proportional to the slope of the curve of F vs w at the point corresponding to the current value of w this will make Dw large when F is far from a minimum and small when it is near to a minimum. So w is corrected rapidly when the error is large and slowly when the error is small. With luck this will make the correction process leap over any false minima and settle precisely on the true minimum. But it can't guarantee it.

To adjust each of the network's weights so as to reduce F for a presented input pattern from the training data we must do what in effect is done by the following 'C' statement:
w += Delta_w; //add the weight-increment to the weightThe problem now is to find a viable method of computing ðF/ðw for any weight w in the network.

where J is the number of output channels (which is the same as the number of neurons in the output layer).

After differentiation, the terms of the summation are zero for all except one value of j. The above is therefore the slope of the curve of F vs one output value o on a particular output pin j with all the other outputs held constant. To make the maths as simple as possible, we take a little mathematical licence and make C = 1/2 so the above equation becomes:

From the training data file, we know what the output signal t should be on a given output pin j for a given sample input pattern. We can observe the corresponding signal o which the network actually puts out.
The equation above therefore provides us with a measure of how an error in one of the network's output signals (an error in the output from a neuron in the output layer) contributes to the network's output error function F.

The way an error da in the activation level 'a' contributes to the network error is therefore given by:


Add corresponding small increments dy and dz to y and z:

Substitute the previous formula for y to get an expression for dy alone:

The first derivative is in effect the slope of the graph of the function at any given point which is dy/dz. Dividing both sides of the above formula by dz:

As dy and dz is made smaller and smaller while preserving their ratio, the terms which are multiplied by dz become insignificantly small compared with those that are not. Therefore the above formula becomes:

Now there is a thing called a chaining rule which we will not go into here which states that differential operators behave exactly like algebraic variables in the sense that:

The whole point of the natural number e is that the function e raised to the power x always has the same value as its derivative, ie:

So by the chaining rule:

Since we have already computed y itself, it is convenient to express the derivative in terms of y rather than in x which is more complicated.

Therefore:

Run the program SIGDASH.EXE in the \SIGMOID directory on the Neuron Diskette to generate and display the graph of y = f '(x).
Finally, substituting our 'neural' variable names, we get the sigmoid's first derivative:

So the change in the network error F resulting from a unit change in the activation level of any neuron is:

To help us to visualise this, we represent the weight 'w' by an electrical rheostat which attenuates the input signal 'i'. An error dw in this weight is therefore represented by a small error in the setting of this rheostat.

The input signal to the neuron under consideration from a neuron 'i' in the previous layer is i.w. Therefore the error e in this input signal must be i.dw. The consequential error da in the neurone's activation level caused by dw must therefore be e / NI, where NI is the number of inputs to this neuron.
Putting these together:

We now have all the information we need to determine how an error dw in a neurone's input weight affects the total network error function F.
By the chaining rule:

Substituting for ðF/ða and ða/ðw:

However, we can only know directly the value of 'o' and ðF/ðo is for neurons in the output layer:

Finding 'o' and ðF/ðo is for neurons in a hidden layer is more complicated.
Consider Neuron 0 in the hidden layer just before the output layer. We will refer to this layer as Layer 'j' and the output layer as Layer 'k' as shown below:

The change da in the activation level of Neuron k=0 caused by a change in the output do of Neuron j=0 is do times the weight w on the connecting link between them. Since we will be implementing our network using 16-bit interger arithmetic we must divide this by NI, the number of inputs to the neurons in Layer 'k', ie:

The error in the activation level of Neuron k=0 per unit error in the output of Neuron j=0 is therefore:

In general, therefore, for any link between Layer 'j' and Layer 'k' the above expression becomes:

What we are looking for is the contribution to the value of the network output error function F produced by the error in the output of Neuron j=0, namely:

By the chaining rule:

because the errors in both the links contribute to the error in the network's output. In general, therefore, for a network with an output layer containing K neurons:


Note: the index k is a different variable from the k in the second brace.

We must find how the error dw in the weight w (see above) on one of Neuron j=0's input links affects the neurone's output. The internal function of hidden neurons is exactly the same as that of the neurons in the output layer. The formula for finding how an error in one of its input weights affects it output is therefore the same, namely:



and for hidden neurons:

In the case of output layer neurons therefore we have all we need directly to compute ðF/ðw for each input weight of each neuron and 'add' to it the appropriate amount:

So before considering any of the hidden layers we will go ahead and adjust the input weights to the output layer.
But notice that the summation term in the 'hidden' formula needs the value of ðF/ða for each neuron in Layer 'k', the output layer (assuming for the moment that we are considering the hidden layer immediately before the output layer). We must therefore save the output layer's values of ðF/ða while we have them ie while we are adjusting the output layer's weights. Then we can pick them up when we come to process the hidden layer.
Similarly, while computing and expediting the weight adjustments for the hidden layer immediately behind the output layer, we save its values of ðF/ða namely:

for when we come to compute and expedite the weight adjustments for the hidden layer immediately behind that one. And so on until we get to the first hidden layer in the network namely the one next to the input.
For each layer of the network
(working backwards from the output):
{
for each neuron in the layer
{
compute ðF/ða;
store it for use next pass of the loop;
for each input weight to the neuron
{
compute ðF/ðw;
w -= h * ðF/ðw; //adjust the weight
}
}
}
In practice, faster code can be produced by splitting the computation of ðF/ða between successive passes of the loop. Full implementation in 'C' of this training algorithm is covered in the document TRAIN2.WRI.