π 1. Why Normalize Data?
- Normalization makes training more stable.
- When data isnβt centered, gradients become more sensitive and updates can bounce around.
- Zero-mean and scaling make the learning smoother.
β‘ 2β4. The Problems with SGD
Stochastic Gradient Descent (SGD) is simple but has some practical issues:
β
Condition Number:
- In high dimensions, your loss function can look like a βtacoβ or βbananaβ shape.
- This makes gradients zig-zag across the valley β convergence is slow.
β
Saddle Points & Local Minima:
- SGD can get stuck in local minima or saddle points because it lacks momentum to βescape.β
β
Stochastic Noise:
- Small batch updates can introduce noisy gradient estimates β this slows down convergence.
π 5. Momentum
- Momentum adds a βpushβ to SGD.
- It helps the gradient roll past small bumps and saddle points.
- It also smooths out the noise over time.