Lecture #7 Normalization, Optimizers & Regularization | Notion

🔄 1. Why Normalize Data?

Normalization makes training more stable.
When data isn’t centered, gradients become more sensitive and updates can bounce around.
Zero-mean and scaling make the learning smoother.

⚡ 2–4. The Problems with SGD

Stochastic Gradient Descent (SGD) is simple but has some practical issues:

✅ Condition Number:

In high dimensions, your loss function can look like a “taco” or “banana” shape.
This makes gradients zig-zag across the valley — convergence is slow.

✅ Saddle Points & Local Minima:

SGD can get stuck in local minima or saddle points because it lacks momentum to “escape.”

✅ Stochastic Noise:

Small batch updates can introduce noisy gradient estimates → this slows down convergence.

🚀 5. Momentum

Momentum adds a “push” to SGD.
It helps the gradient roll past small bumps and saddle points.
It also smooths out the noise over time.