Gradient Descent & Newton's Method Notes
1. Gradient Descent Basics
The update rule for Gradient Descent is:
w_{t+1} = w_t - \\\\alpha \\\\cdot g(w_t)
where:
- \( g(w_t) \) = gradient of the loss function w.r.t. weights.
- \( \alpha \) = learning rate.
- Early idea: \( \alpha = 1/t \) (t = iteration count), but very slow in practice.
- this -alpha ensures that it always will go down
2. AdaGrad (Adaptive Gradient)
AdaGrad adapts the learning rate individually for each parameter by scaling with past squared gradients.
Update rule:
w_{t+1} = w_t - \\\\frac{\\\\alpha}{\\\\sqrt{G_t + \\\\epsilon}} \\\\cdot g(w_t)
where:
- \( G_t \) = sum of squared past gradients.
- \( \epsilon \) = small number to avoid division by zero.
3. Newton's Method (Speeding Up Near Minima)
To accelerate convergence in flat regions, Newton’s Method uses second-order curvature.