I would like to understand the difference between SGD and GD on the easiest example of function: y=x**2
The function of GD is here:
def gradient_descent(
gradient, start, learn_rate, n_iter=50, tolerance=1e-06
):
vector = start
for _ in range(n_iter):
diff = -learn_rate * gradient(vector)
if np.all(np.abs(diff) <= tolerance):
break
vector += diff
return vector
And in order to find the minimum of x**2 function we should do next (the answer is almost 0, which is correct):
gradient_descent(gradient=lambda v: 2 * x, start=10.0, learn_rate=0.2)
How i understood, in the classical GD the gradient is calculated precisely from all the data points. What is "all data points" in the implementation that i showed above?
And further, how we should modernize this function in order to call it SGD (SGD uses one single data point for calculating gradient. Where is "one single point" in the gradient_descent
function?)