Questions tagged [gradient-descent]

Gradient Descent is an algorithm for finding the minimum of a function. It iteratively calculates partial derivatives (gradients) of the function and descends in steps proportional to those partial derivatives. One major application of Gradient Descent is fitting a parameterized model to a set of data: the function to be minimized is an error function for the model.

Wiki:

Gradient descent is a first-order iterative optimization algorithm. It is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).

To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.

Gradient descent is also known as steepest descent, or the method of steepest descent.


Tag usage:

Questions on should be about implementation and programming problems, not about the theoretical properties of the optimization algorithm. Consider whether your question might be better suited to Cross Validated, the StackExchange site for statistics, machine learning and data analysis.


Read more:

1428 questions
331
votes
7 answers

Why do we need to call zero_grad() in PyTorch?

Why does zero_grad() need to be called during training? | zero_grad(self) | Sets gradients of all model parameters to zero.
user1424739
  • 11,937
  • 17
  • 63
  • 152
166
votes
6 answers

pytorch - connection between loss.backward() and optimizer.step()

Where is an explicit connection between the optimizer and the loss? How does the optimizer know where to get the gradients of the loss without a call liks this optimizer.step(loss)? -More context- When I minimize the loss, I didn't have to pass the…
aerin
  • 20,607
  • 28
  • 102
  • 140
139
votes
4 answers

Pytorch, what are the gradient arguments

I am reading through the documentation of PyTorch and found an example where they write gradients = torch.FloatTensor([0.1, 1.0, 0.0001]) y.backward(gradients) print(x.grad) where x was an initial variable, from which y was constructed (a…
Qubix
  • 4,161
  • 7
  • 36
  • 73
125
votes
8 answers

Why should weights of Neural Networks be initialized to random numbers?

I am trying to build a neural network from scratch. Across all AI literature there is a consensus that weights should be initialized to random numbers in order for the network to converge faster. But why are neural networks initial weights…
123
votes
6 answers

Common causes of nans during training of neural networks

I've noticed that a frequent occurrence during training is NANs being introduced. Often times it seems to be introduced by weights in inner-product/fully-connected or convolution layers blowing up. Is this occurring because the gradient computation…
103
votes
4 answers

How to do gradient clipping in pytorch?

What is the correct way to perform gradient clipping in pytorch? I have an exploding gradients problem.
Gulzar
  • 23,452
  • 27
  • 113
  • 201
82
votes
10 answers

Neural network always predicts the same class

I'm trying to implement a neural network that classifies images into one of the two discrete categories. The problem is, however, that it currently always predicts 0 for any input and I'm not really sure why. Here's my feature extraction method: def…
76
votes
5 answers

What is the difference between Gradient Descent and Newton's Gradient Descent?

I understand what Gradient Descent does. Basically it tries to move towards the local optimal solution by slowly moving down the curve. I am trying to understand what is the actual difference between the plain gradient descent and the Newton's…
75
votes
4 answers

why gradient descent when we can solve linear regression analytically

what is the benefit of using Gradient Descent in the linear regression space? looks like the we can solve the problem (finding theta0-n that minimum the cost func) with analytical method so why we still want to use gradient descent to do the same…
John
  • 2,107
  • 3
  • 22
  • 39
65
votes
5 answers

gradient descent using python and numpy

def gradient(X_norm,y,theta,alpha,m,n,num_it): temp=np.array(np.zeros_like(theta,float)) for i in range(0,num_it): h=np.dot(X_norm,theta) #temp[j]=theta[j]-(alpha/m)*( np.sum( (h-y)*X_norm[:,j][np.newaxis,:] ) ) …
56
votes
4 answers

Why do we need to explicitly call zero_grad()?

Why do we need to explicitly zero the gradients in PyTorch? Why can't gradients be zeroed when loss.backward() is called? What scenario is served by keeping the gradients on the graph and asking the user to explicitly zero the gradients?
Wasi Ahmad
  • 35,739
  • 32
  • 114
  • 161
53
votes
5 answers

pytorch how to set .requires_grad False

I want to set some of my model frozen. Following the official docs: with torch.no_grad(): linear = nn.Linear(1, 1) linear.eval() print(linear.weight.requires_grad) But it prints True instead of False. If I want to set the model in eval…
Qian Wang
  • 764
  • 2
  • 7
  • 13
53
votes
4 answers

What is the difference between SGD and back-propagation?

Can you please tell me the difference between Stochastic Gradient Descent (SGD) and back-propagation?
47
votes
1 answer

Sklearn SGDClassifier partial fit

I'm trying to use SGD to classify a large dataset. As the data is too large to fit into memory, I'd like to use the partial_fit method to train the classifier. I have selected a sample of the dataset (100,000 rows) that fits into memory to test fit…
David M.
  • 4,518
  • 2
  • 20
  • 25
43
votes
5 answers

How to calculate optimal batch size?

Sometimes I run into a problem: OOM when allocating tensor with shape e.g. OOM when allocating tensor with shape (1024, 100, 160) Where 1024 is my batch size and I don't know what's the rest. If I reduce the batch size or the number of neurons in…
1
2 3
95 96