1

I have gotten the assignment to analyze a dataset of 1.000+ houses, build a multiple regression model to predict prices and then select the three houses which are the cheapest compared to the predicted price. Other than selecting specifically three houses, there is also the constraint of a "budget" of 7.000.000 total for purchasing the three houses.

I have gotten so far as to develop the regression model as well as calculate the predicted prices and the risiduals and added them to the original dataset. I am however completely stumped as to how to write a code to select the three houses, given the budget restraint and optimizing for highest combined risidual.

Here is my code so far:

### modules
import pandas as pd
import statsmodels.api as sm


### Data import
df = pd.DataFrame({"Zip Code" : [94127, 94110, 94112, 94114],
                   "Days listed" : [38, 40, 40, 40],
                   "Price" : [633000, 1100000, 440000, 1345000],
                   "Bedrooms" : [0, 3, 0, 0],
                   "Loft" : [1, 0, 1, 1],
                   "Square feet" : [1124, 2396, 625, 3384],
                   "Lotsize" : [2500, 1750, 2495, 2474],
                   "Year" : [1924, 1900, 1923, 1907]})

### Creating LM
y = df["Price"] # dependent variable
x = df[["Zip Code", "Days listed", "Bedrooms", "Loft", "Square feet", "Lotsize", "Year"]]

x = sm.add_constant(x) # adds a constant
lm = sm.OLS(y,x).fit() # fitting the model

# predict house prices
prices = sm.add_constant(x)

### Summary
#print(lm.summary())

### Adding predicted values and risidual values to df
df["predicted"] = pd.DataFrame(lm.predict(prices)) # predicted values
df["risidual"] = df["Price"] - df["predicted"] # risidual values

If anyone has an idea, could you explain to me the steps and give a code example? Thank you very much!

Emil
  • 25
  • 4
  • 1
    The budget constraint is a [knapsack problem](https://en.wikipedia.org/wiki/Knapsack_problem). – Swier Aug 05 '22 at 11:53

1 Answers1

1

With the clarification that you are looking for the best combination your problem is more complicated ;) I have tried a "brute-force" approach but at least my laptop takes forever with the full dataset. Find below my thoughts:

Obviously we have to calculate the combinations of many houses, therefore my first approach was to reduce the dataset as far as possible.

  1. If Price+2*min(Price)>budget there will be no combination with two houses that is smaller
  2. If risidual is negative we will not consider the house during optimization

In pandas this will look as this:

budget=7000000

df=df[df['Price']<(budget-2*df['Price'].min())].copy()
df=df[df['risidual']>0].copy()  

This reduces the objects from 1395 to 550.

Unfortunatly, 550 ID are still many combinations (27578100) as calculated with itertools:

import itertools
idx=[a for a in itertools.combinations(df.index,3)] 

You can evaluate these combinations by

result={comb: df.loc[[*comb], 'risidual'].sum() for comb in idx[10000:] if df.loc[[*comb], 'Price'].sum() < budget}

Note: I have limited the evaluation to the first 10000 values due to the calculation duration.

print("Combination: {}\nPrice: {}\nCost: {}".format(max(result),df.loc[[*max(result)], 'Price'].sum(),result[max(result)]  ))

Maybe it is advisable to calculate the combination of just two object first to further reduce the possible combinations. I think you should have a look at the Knapsack problem


I think you are almost there. Given that df["risidual"] has the difference between predicted and real price you have to select the subset that fits your limit e.g.

df_budget=df[df['price']<=budget].copy()

using pandas nlargest() you could retrieve the three biggest differences

df_budget.nlargest(3, 'risidual')

Note: Code was not tested due to missing sample data

Jacob
  • 304
  • 1
  • 6
  • Hi, thanks this solves it for individual houses below the budget. However the budget is supposed to cover a group of three houses. So three lines of data/prices should have the sum below 7.000.000 and highest risidual value total (again sum) compared to other groups of three houses within the budget. - Added the data for code testing – Emil Aug 05 '22 at 10:42
  • Refrain from showing your dataframe as an image. Your question needs a minimal reproducible example consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for best practices related to Pandas questions – itprorh66 Aug 05 '22 at 13:09