0

I'm banging my head trying to figure out what I'm doing wrong here:

df.read_csv('data.csv')

# Determine slope and intercept
area = np.array(df['area'])
rooms = np.array(df['rooms'])
balcony = np.array(df['balcony'])
age = np.array(df['age'])
price = np.array(df['price'])

features = np.array(df[['area', 'rooms', 'balcony', 'age']])

rows, cols = features.shape

limit = 1000
learn = 0.00001
slope = np.zeros((cols,))
intercept = 0
history = []

for i in range(limit):

    residual = (np.dot(features, slope) + intercept) - price

    derivative_of_slope = np.zeros((cols,))
    derivative_of_intercept = residual.mean()

    for j in range(cols):
        derivative_of_slope[j] = np.dot(features.take(j, axis=1), residual) # I think the issue is here and I'm overlooking something

    derivative_of_slope /= rows

    history.append({'cost': derivative_of_intercept, 'intercept': intercept, 'slope': slope})

    slope = slope - learn * derivative_of_slope
    intercept = intercept - learn * derivative_of_intercept


history = pd.DataFrame(history, columns=['cost', 'intercept', 'slope'])
history[-1:]

Here is the sample output of the features dataset:
enter image description here

The issue I'm having is that for some reason the 2nd, 3rd, and 4th slope parameters don't converge, I played around a bit with the learning rate and the number of iterations and got somewhat close a few times but not really. The closest slopes I had still have me >10k higher prediction.

Example of my determined slope & intercept:
enter image description here

And the slope as determined by Sklearn:
enter image description here

Code used to generate the dataset, generates a data.csv file:

n = 1000
avg_price_per_m2 = 1500

with open('data.csv', 'w') as dataset:
    writer = csv.DictWriter(dataset, ['area', 'rooms', 'balcony', 'age', 'price'])
    writer.writeheader()

    for i in range(n):
        area_in_m2 = round(random.uniform(25.00, 90.00), 2)

        number_of_rooms = random.randint(1, 8)
        has_balcony = random.randint(0, 1)
        age = random.randint(0, 70)

        # Base price
        price_per_m2 = avg_price_per_m2 + random.randint(50, 350)

        # Increase base price between 100, 300 for each room
        price_per_m2 += random.randint(100, 300) * number_of_rooms

        # Increase price by 3-8% if the house has a balcony
        price_per_m2 *= random.uniform(1.03, 1.08)

        # Decrease price by 0.5% for each year
        price_per_m2 -= age * 0.05

        price = round(area_in_m2 * price_per_m2, 2)

        row = {'area': area_in_m2, 'rooms': number_of_rooms, 'balcony': has_balcony, 'age': age, 'price': price}
        writer.writerow(row)
Sterling Duchess
  • 1,970
  • 16
  • 51
  • 91
  • Please post an actual minimal working example. :( see https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Dominik Stańczak Aug 04 '22 at 11:21
  • @DominikStańczak I've included the code that creates the sample dataset. Its now a fully working example. – Sterling Duchess Aug 04 '22 at 11:35
  • Did you plot the residuals, slope parameters and gradients over training time? Usually this can give insight about where it's going wrong. – André Aug 04 '22 at 12:03
  • @André thanks completely forgot. I plotted the values seems my learning rate & iteration were poorly picked. I'm now getting values much close to what sklearn has – Sterling Duchess Aug 04 '22 at 13:33

0 Answers0