1

I have a problem solving for x and y using multiple equations. I have different data points (in # of frames), as follows:

  • Group 1: 1003, 145, 1344, 66, 171, 962

  • Group 2: 602, 140, 390, 1955, 289, 90

I have total hours as follows:

  • Total Hours: 1999, 341, 1151, 2605, 568, 864

I have set these up in different equations like this:

1003x + 602y = 1999 145x + 140y = 341 and so on.

I would like to find the optimal values for x and y that make all equations as close to true as can be.


I tried a linear regression in Python to extract the data, but I am unsure if I am going down the right road or not.

Here is my code in Python:

dataset = pd.read_csv(r"C:\Users\path\to\.csv")

X = dataset[['Group 1 Frames', 'Group 2 Frames']]
y = dataset['Total Hours']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=0)

regressor = LinearRegression()
regressor.fit(X_train, y_train)

coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df

Now this gives me two different values, 1.3007 and 1.2314. After calculating the Mean Absolute Error and the Mean Squared Error, it seems the results conclude that the numbers are inaccurate and unusable.

Is there a way to more accurately calculate the desired x and y values?


My thoughts as to the error:

  1. My method (I am very new to python and data analysis like this, so I bet heavy on this one)
  2. Lack of data points (I can collect more)
  3. x and y don't have a great relationship with Total Hours, hence the high error
MicBalla
  • 15
  • 5
  • You allow for a constant in your equation, yes? I.e., `g1*x + g2*y + C = h_t`. – rickhg12hs Jun 21 '20 at 12:59
  • Would I account for a constant by adding in a column of 0's or something like that? – MicBalla Jun 22 '20 at 15:28
  • `LinearRegression()` will calculate the intercept. I included the calculated intercept (constant) in [my answer](https://stackoverflow.com/a/62419725/1409374). – rickhg12hs Jun 22 '20 at 16:46

2 Answers2

0

I have tried your example. Here are my observations and suggestions.

1) You are not training your model on enough data. => I tried inserting a couple of random data points to your DF and the score shifted from -27.xx to -0.10. This shows that you need more training data.

2) Use scaler (like StandardScaler) to scale your data points before actually using .fit to fit your data in regressor. This will scale your data points so that they have mean value 0 and standard deviation of 1.

After doing above mentioned 1 & 2 I got a score of -0.10xx (way better than initial -27.xx) and coeff of 282.974346 & 759.690447 for group1 & group 2 respectively

Here is my code for your reference which I have tried:

(THIS CONTAINS THAT DUMMY DATA WHICH I HAVE RANDOMLY INSERTED (last 4 data points in each group))

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

g1=np.array([1003, 145, 1344, 66, 171, 962,100,200,300,400])
g2=np.array([602, 140, 390, 1955, 289, 90,80,170,245,380])

th=np.array([1999, 341, 1151, 2605, 568, 864,1000,300,184,411])

dataset = pd.DataFrame({'Group 1 Frames':g1,'Group 2 Frames':g2,'Total Hours':th})

X = dataset[['Group 1 Frames', 'Group 2 Frames']]
# print(X)
y = dataset['Total Hours']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=0)

pipeline=Pipeline([('norm', StandardScaler()), ('linreg', LinearRegression())])

pipeline.fit(X_train, y_train)

print(pipeline.score(X_test,y_test))

y_pred=pipeline.predict(X_test)

fig,ax=plt.subplots()
ax.plot(X_test,y_test,label='Actual')
ax.plot(X_test,y_pred,label='Predicted')

ax.legend()
plt.show()

coeff_df = pd.DataFrame(pipeline['linreg'].coef_, X.columns, columns=['Coefficient'])
print(coeff_df)

HERE I HAVE ALSO PLOTTED THE PREDICTED DATA V/S THE ACTUAL TEST DATA enter image description here

Kaustubh Lohani
  • 635
  • 5
  • 15
  • I took some time to study Scalers and run through your code. I have a question for you. 1. How is pipeline's `linear regression coeff_` different from `regressor.coeff_`? My coefficients from `regressor.coeff_` are usable and close to the expected result (1.3xx and 1.2xx whereas the coefficients from pipeline are much larger. I am thinking these numbers are different because of the scaling which you did? – MicBalla Jun 18 '20 at 15:06
  • Both the regression coeff's are same the difference is scaling. Pipeline just sort of bundles all the steps together so that you don't have to fit each of them separately. Scaling just scaled all the data so that all of them have a mean of '0' and std of '1'. Here is a SO question explaining StandardScaler. https://stackoverflow.com/questions/40758562/can-anyone-explain-me-standardscaler#40767144 – Kaustubh Lohani Jun 19 '20 at 09:47
  • I didn't have much time over the weekend to read up on this but it seems like the right road I want to go down. Thank you for your help @Zeek. I understand the idea of scaling, but I will have to read into it more and get more hands on to see exactly how the coefficients scale and how to apply them to my example. – MicBalla Jun 22 '20 at 15:29
0

Borrowing heavily from the answer by @Zeek,

import pandas as pd
import numpy as np

g1=np.array([1003, 145, 1344, 66, 171, 962])
g2=np.array([602, 140, 390, 1955, 289, 90])
th=np.array([1999, 341, 1151, 2605, 568, 864])
dataset = pd.DataFrame({'Group 1 Frames':g1,'Group 2 Frames':g2,'Total Hours':th})
X = dataset[['Group 1 Frames', 'Group 2 Frames']]
y = dataset['Total Hours']

from sklearn import linear_model

reg = linear_model.LinearRegression()
reg.fit(X,y)

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

fig=plt.figure()
ax=fig.add_subplot(111,projection='3d')
ax.scatter3D(X['Group 1 Frames'],X['Group 2 Frames'],y,c='blue')
ax.scatter3D(X['Group 1 Frames'],X['Group 2 Frames'],reg.predict(X),c='red')

ax.set_xlabel('Group 1 Frames')
ax.set_ylabel('Group 2 Frames')
ax.set_zlabel('Total Hours')

plt.show()

gives:

In [2]: reg.coef_                                                                                                                                         
Out[2]: array([0.65638179, 1.29127836])

In [3]: reg.intercept_                                                                                                                                    
Out[3]: 104.95400059973235

and:

enter image description here

... which isn't too bad except for the first and third data samples. Depending on your prediction requirements this may or may not be good enough. Perhaps some of your data needs some "massaging"?

rickhg12hs
  • 10,638
  • 6
  • 24
  • 42