Deriving the Learning Correction#
For gradient descent, we need to derive the update to the matrix
Let’s start with our cost function:
where we’ll refer to the product
We can compute the derivative with respect to a single matrix
element,
with
and for
which gives us:
where we used the fact that the
Note that:
is the error on the output layer, and the correction is proportional to the error (as we would expect).The
superscripts here remind us that this is the result of only a single pair of data from the training set.
Now
where the operator
Performing the update#
We could do the update like we just saw with our gradient descent
example: take a single data point,
Instead we take multiple passes through the training data (called epochs) and apply only a single push in the direction that gradient
descent suggests, scaled by a learning rate
The overall minimization appears as:
Loop over the training data,
. We’ll refer to the current training pair asPropagate
through the network, getting the outputCompute the error on the output layer,
Update the matrix
according to: