I have gotten the assignment to analyze a dataset of 1.000+ houses, build a multiple regression model to predict prices and then select the three houses which are the cheapest compared to the predicted price. Other than selecting specifically three houses, there is also the constraint of a "budget" of 7.000.000 total for purchasing the three houses.
I have gotten so far as to develop the regression model as well as calculate the predicted prices and the risiduals and added them to the original dataset. I am however completely stumped as to how to write a code to select the three houses, given the budget restraint and optimizing for highest combined risidual.
Here is my code so far:
### modules
import pandas as pd
import statsmodels.api as sm
### Data import
df = pd.DataFrame({"Zip Code" : [94127, 94110, 94112, 94114],
"Days listed" : [38, 40, 40, 40],
"Price" : [633000, 1100000, 440000, 1345000],
"Bedrooms" : [0, 3, 0, 0],
"Loft" : [1, 0, 1, 1],
"Square feet" : [1124, 2396, 625, 3384],
"Lotsize" : [2500, 1750, 2495, 2474],
"Year" : [1924, 1900, 1923, 1907]})
### Creating LM
y = df["Price"] # dependent variable
x = df[["Zip Code", "Days listed", "Bedrooms", "Loft", "Square feet", "Lotsize", "Year"]]
x = sm.add_constant(x) # adds a constant
lm = sm.OLS(y,x).fit() # fitting the model
# predict house prices
prices = sm.add_constant(x)
### Summary
#print(lm.summary())
### Adding predicted values and risidual values to df
df["predicted"] = pd.DataFrame(lm.predict(prices)) # predicted values
df["risidual"] = df["Price"] - df["predicted"] # risidual values
If anyone has an idea, could you explain to me the steps and give a code example? Thank you very much!