Gradient Descent & Newton's Method Notes

1. Gradient Descent Basics

The update rule for Gradient Descent is:

w_{t+1} = w_t - \\\\alpha \\\\cdot g(w_t)

where:

\( g(w_t) \) = gradient of the loss function w.r.t. weights.
\( \alpha \) = learning rate.
Early idea: \( \alpha = 1/t \) (t = iteration count), but very slow in practice.
this -alpha ensures that it always will go down

AdaGrad adapts the learning rate individually for each parameter by scaling with past squared gradients.

Update rule:

w_{t+1} = w_t - \\\\frac{\\\\alpha}{\\\\sqrt{G_t + \\\\epsilon}} \\\\cdot g(w_t)

where:

To accelerate convergence in flat regions, Newton’s Method uses second-order curvature.