Notes from Source: http://neuralnetworksanddeeplearning.com/chap3.html accessed on 6/20/2016
- Batches have some matrix math so they can be done in bulk! Look into this.
- Initialization and regularization
We've been using {$ \mu = 0 $} and {$ \sigma = 1 $} to initialize both biases and weights using a Gaussian distribution.
However, there is some evidence to support using lower initial weights. In particular:
{$$ \sigma_w = {1 \over \sqrt{ \eta_{in} } } $$}
This paper is referenced for this and many other learning params: https://arxiv.org/pdf/1206.5533v2.pdf
To help fight overfitting: L2 regularization, L1 regularization, and dropout
When compared to L2, L1 tends to reduce weights less (less reduction) when w is large.
But, L1 also reduces smaller weights more. So L1 tends towards large weights and 0 weights
by nuking the small ones and not nuking as much the big ones.
Tips I've listed these as gross (basically tuning you need to do to figure out what you want to use) and fine (small adjustments to eek out the last few percentage points of accuracy).
[E]arly Tuning [F]ine Tuning [M]ust Tuning [O]ptional
- [M] Measure learning by classification accuracy on data not in the training set
- [E] Tune parameters using a smaller training set for speed - Gross Tuning
- [E] Find the proper learning rate in factors of 10 at first, then using hi-low to narrow in to the last value where learning still occurs - Gross Tuning
- [E] Set {$ \eta $} to half of the value where learning first stops occurring (but still does occur) - Gross Tuning
- [E][O]ption: Stopping Early: (more on paper) If learning doesn't improve in {$n$} (10-20-50...) steps then stop training. Increase {$n$} as you're more sure of parameters. - Gross to Fine Tuning
- [O]ption: Learning Rate Scheduling: (more on paper) Learn until no improvement and then decrease learning rate. Fine Tuning.
- [F] Regularization Parameter {$\lambda$} - Find the value for {$\eta$} first! Then use factors of 10 to find {$\lambda$} starting with 1. But use a cycle to then improve {$\eta$} again. Fine Tuning
- [F] Sizes of batches: Try different sizes and figure out which gives the best increase in validation per period of time. Use that size. Fine Tuning
- [O] Automating Learning: There are some papers where folks try to automate learning the parameters above. Explore further if wanted
- Using the Second Derivative!
- Hessian - {$H$} Hessian matrix - instead of gradient descent -
{$$ H_{jk} = { \partial^2C \over \partial w_j \partial w_k } $$}
{$$ C(w+ \Delta w) = C(w) + \nabla C \cdot \Delta w + 0.5 \Delta w^T H \Delta w $$}
and finally, this ultimately gives way to making steps according to:
{$$ \Delta w = - \eta H^{-1} \nabla C $$}
- But Hessian matrices become large and thus hard to compute. {$n$} weights and biases requires a Hessian with {$n^2$}
- Momentum-based gradient descent: Use the gradient, but how fast is the gradient actually changing?
For each weight, we add a velocity that describes how fast we're moving in that direction.
{$$ v' = \mu v - \eta \nabla C $$}
{$$ w' = w + v' $$}
- {$ \mu $} slows down the velocity. So if it is 0, then we revert to gradient descent but when it is close 1, the {$v$} can continue to grow as fast as possible. So it is another hyper-parameter that is often called the momentum co-efficient.
- tanh! - hyperbolic tangent function Note: {$ \sigma(z) = {1+ tanh( {z \over 2}) \over 2} $}
- max(0,z) - rectified linear unit