-1

So I'm a newbie to machine-learning and i have been trying to implement gradient descent. My code seems to be right (I think) but it didn't converge to the global optimum.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


def AddOnes(matrix):
    one = np.ones((matrix.shape[0], 1))
    X_bar = np.concatenate((one, matrix), axis=1)
    return X_bar


# Load data
df = pd.read_excel("Book1.xlsx", header=3)
X = np.array([df['Height']]).T
y = np.array([df['Weight']]).T

m = X.shape[0]
n = X.shape[1]
iterations = 30

# Build X_bar
X = AddOnes(X)

# Gradient descent
alpha = 0.00003
w = np.ones((n+1,1))
for i in range(iterations):
    h = np.dot(X, w)
    w -= alpha/m * np.dot(X.T, h-y)

print(w)

x0 = np.array([np.linspace(145, 185, 2)]).T
x0 = AddOnes(x0)
y0 = np.dot(x0, w)
x0 = np.linspace(145, 185, 2)

# Visualizing
plt.plot(X, y, 'ro')
plt.plot(x0, y0)
plt.axis([140, 190, 40, 80])
plt.xlabel("Height(cm)")
plt.ylabel("Weight(kg)")
plt.show()

Visualizing data

  • 3
    What's the question? There is no guarantee that GD will converge to the global optimum. – cheersmate Nov 06 '18 at 13:41
  • 1
    Getting to the global optima requires you to tune the two hyperparameters: the learning rate (alpha) and the number of iterations, have you done this? – Dr. Snoopy Nov 06 '18 at 13:49
  • I think with just 2 features there should be just 1 optimum, isn't it ? Or am I wrong ? – Ming Zheng Nov 06 '18 at 13:56
  • There can be many local optima depending on the data. Consider the function sin(x)*sin(y) for instance. – John Nov 06 '18 at 15:25
  • 1
    It may help to start by fitting two data points to a straight line, and after that works then try something more complex. – James Phillips Nov 06 '18 at 16:57
  • Possible duplicate of [gradient descent using python and numpy](https://stackoverflow.com/questions/17784587/gradient-descent-using-python-and-numpy) – Itamar Mushkin Jun 30 '19 at 05:28

1 Answers1

0

You are using Linear Regression with a single neuron, a single neuron can only learn a straight line irrespective of the dataset you provide, where W acts as slope , your network has learnt optimal W for your X such that WX gives minimal error.

The scatter plot (red dot) of the output shows your dataset values, you can observe that , the dataset is not linear, so even if you train 1M times, the algorithm will never converge. But the learnt function is optimal for sure, as it is a straight line that has minimal error.

So , I recommed you to use multiple layers with non-liner activations like ReLu and Sigmoid. Use linear activation at output as you are predicting a real number.