relation of loss to features in gradient descent calculation

Question

My teammates and I are trying to code up an implementation of gradient descent and I think we're pretty close

We've (attempted) to follow the steps associated with the first answer to this question, namely:

1. Calculate the hypothesis h = X * theta

2. Calculate the loss = h - y and maybe the squared cost (loss^2)/2m

3. Calculate the gradient = X' * loss / m

4. Update the parameters theta = theta - alpha * gradient

But as you can see from the missing code there, we're at a bit of a loss as to how to calculate the gradient, have we set it up correctly?

How to execute that calculation?

What is the difference between X' and X?

  double loss, cost, hypothesis;
  int p, iteration;

  iteration = 0;
  do 
  {
    iteration++;
    cost = 0.0;
    //loop through all instances (complete one epoch)
    for (p = 0; p < number_of_files__train; p++) 
    {

      hypothesis = calculateHypothesis( weights, feature_matrix__train, p, globo_dict_size );

      loss = outputs__train[p] - hypothesis;

      for (int i = 0; i < globo_dict_size; i++) 
      {

        weights[i] += LEARNING_RATE * loss * feature_matrix__train[p][i] * calculateGradent( weights, i, number_of_files__train, loss );

      }

      //summation of squared error (error value for all instances)
      cost += (loss*loss);
    }
    cost = cost/(2 * number_of_files__train);

  } 
  while(cost != 0 && iteration<=MAX_ITER);


}

static double calculateHypothesis( double weights[], double[][] feature_matrix, int file_index, int globo_dict_size )
{
  //# m denotes the number of examples here, not the number of features

 double sum = 0.0;

 for (int i = 0; i < globo_dict_size; i++) 
 {
   sum += ( weights[i] * feature_matrix[file_index][i] );
 }
 //bias
 sum += weights[ globo_dict_size ];

   return sigmoid(sum);
}

private static double sigmoid(double x)
{
    return 1 / (1 + Math.exp(-x));
}

static double calculateGradent( double weights[], int i, int number_of_files__train, double loss )
{
  return weights[i] * loss / number_of_files__train;
}

X is a matrix of inputs, whereas X' is the transpose of that matrix, viz. if X is a row matrix of the inputs, X' is a column matrix of those same inputs. From the example you linked x is X, and xTrans is X'. You can see that they generate xTrans by performing the transpose() operation on x. — Andrew Domaszek, Mar 08 '15 at 06:25
then why the hell would it matter if i do operations on the weights as a column or as a row? it's just a 1D array, why does that matter at all? — , Mar 08 '15 at 06:26
In matrix algebra, the dimension of the array makes a lot of difference when performing dot and cross products. — Andrew Domaszek, Mar 08 '15 at 06:28
In the scalar case, it's the same thing. But to be a valid gradient descent algorithm you can't incrementally update theta parameters one partial derivative at a time; you must calculate all partials, then update all thetas at once. calculateGradient should produce a vector the same dimension and rank as theta (weights), per gradient[i] = weights[i] * loss/m; where m is the rank of weights vector. — Andrew Domaszek, Mar 08 '15 at 06:57
[Matrix rank](http://en.wikipedia.org/wiki/Rank_%28linear_algebra%29), the number of elements in the vector. — Andrew Domaszek, Mar 08 '15 at 08:30

relation of loss to features in gradient descent calculation

0 Answers0