0

I am trying to run the following formula with a data frame and a series.

Let X be data frame with 3 columns. ( let it be a 100x3 matrix). Let y be a vector ( 100x1 matrix) X:

    X0  sqrfeet  bedrooms   
0   1     2104         3  
1   1     1600         3  
2   1     2400         3  
3   1     1416         2  
4   1     3000         4 

y:

0 20000
1 15000
2 24000
3 12000
4 14000

The formula I want to use is:- inv(X'*X)*X'*y

this is the formula for normal equation. Here X' implies X transpose and inv represents inverse. The code I had used is:-

var= (np.linalg.inv((X.T).dot(X)))
var2= var.dot(X.T)
final=var2.dot(Y)

Is the above correct?

Let X represents the real estate data with house size and number of bedrooms while Y corresponds to price.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
sunny
  • 643
  • 2
  • 11
  • 29
  • I guess what you want to do is a OLS regression, have a look at this: http://stackoverflow.com/questions/19991445/run-an-ols-regression-with-pandas-data-frame – FLab Feb 13 '17 at 16:58
  • I suppose you're doing this to learn, but just in case: [Don't invert that matrix](https://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/). – chthonicdaemon Feb 22 '17 at 17:12

1 Answers1

1

It looks like you want to roll your own OLS estimator for homework or personal development, in which case you're on the right track, but here are a couple of things to bear in mind.

Pandas DataFrame objects have a method, as_matrix(), which returns a numpy matrix of the DataFrame's values. Non-numeric values will yield NaN, but your example above should be fine since all the values are numeric. You can carry out linalg operations on these matrices as much as it pleases you.

Something else you will want to bear in mind is the orientation of your regression design matrix (the variable X in this example). The design matrix is a d * n matrix, where d is the number of features and n is the sample size. The Y matrix is a n * 1$ matrix. In order to make the matrix multiplication in the Normal Equation work, you'll need to make sure they align properly.

If you need to do a lot of fully-featured linear regressions, you might want to consider an established library, such as StatsModels

R Hill
  • 1,744
  • 1
  • 22
  • 35
  • This is helpful. I tried to run my code using the StatsModels sm.OLS(Y, X) and it gave pretty much the same values for intercept, coefficient1 and 2( I call it theta0, theta1 and 2). I am brushing up my machine learning. – sunny Feb 14 '17 at 03:20