1

(Sorry in advance, I'm new to Stack and this is my first question)

I have two dataframes, one containing house prices for different properties, ppd_df:

     price_paid deed_date     postcode  property_type   norm_price
2    36250      2015-11-16    BA1 1JU   F   
3    48000      2015-11-25    BA2 0HB   S   
4    60000      2017-08-31    BA1 4NB   F
... ... ... ... ... ...
8960 4025000    2015-07-16    BA1 2EU   T

And, a lookup dataframe of price averages for each property type in each month, ave_df:

    D_price S_price T_price F_price price_date  month_end
0   459471  285234  247582  208652  2015-01-01  2015-01-31
1   450617  279424  242798  205163  2015-02-01  2015-02-28
2   444885  275747  239328  202948  2015-03-01  2015-03-31
3   443513  274575  238553  201615  2015-04-01  2015-04-30
.........................................................
10  489997  303307  262281  218513  2015-11-01  2015-11-30
11  479240  297111  256468  213380  2015-12-01  2015-12-31

I want to normalise the prices in ppd_df to make them more comparable by dividing each price_paid by the average houseprice for that certain type of property in the month that it was bought and save this new value as norm_price.

So norm_price for a F property with a deed_date of 2015-02-16 would be: norm_price = price_paid / ( 205163)

I think I need to create a for loop that goes through each row in the ppd_df but am not sure how to go about it. I have tried using .itertuples. .merge and np.searchsorted like in Pandas: select DF rows based on another DF. The answers here also seem similar but I can't get it to work for my problem.

Thanks!

Kim_Turtle
  • 63
  • 1
  • 8

2 Answers2

1

Using @Prish's solution I manage to answer my question in a few more steps:

def norm(row):
   t_col = row['property_type'] + '_price'
   date = row['deed_date']
   d_row = ave_df.index.values[(date >= ave_df['price_date']) & (date <= ave_df['month_end'])]                       
   ave_price = ave_df.loc[d_row, t_col].values[0]
   return ave_price

ppd_df['norm_price'] = ppd_df['price_paid'] / ppd_df.apply(lambda row: norm(row), axis=1)
Kim_Turtle
  • 63
  • 1
  • 8
0

for a quick solution, you can do something like:

ppd_df.loc[ppd_df['property_type']=='T','norm_price'] = ppd_df['price_paid']/ave_df['T_price'] 

similar thing for the rest of the property_type.

EDIT: you can automate it in following way:

property_type = ppd_df['property_type'].unique()

for property in property_type:
    property_col = property + '_price'
    ppd_df.loc[(ppd_df['property_type']==property) & (ppd_df['deed_date'].str[:7]==ave_df['month_end'].str[:7]),'norm_price'] = ppd_df['price_paid']/ave_df[property_col] 

EDIT:
Check the updated answer. Now it also compares Year/month value of deed_date from ppd_df against month_end value of ave_df. 
Prish
  • 605
  • 7
  • 21
  • Hi Prish, thanks however this doesn't take into account looking-up the correct row based on the 'deed_date' falling in the range of the 'price_date' and 'month_end'. – Kim_Turtle Mar 03 '20 at 11:41
  • @Kim_Turtle Could you explain a bit more about date look up? I'm confused about it. i.e. for the first three row with property_code F,S and F, what are the appropriate values from ave_df? EDIT: Is it simply month and year value look up? – Prish Mar 03 '20 at 16:39
  • Sorry, I didn't explain that the 'ave_df' data went further than the rows shown, these can now be seen in the editted question. The price of the first row property in ppd_df should be divided by 218513, the F_price in the month of 2015-11 (so between price_date 2015-11-1 and month_end 2015-11-30) as the deed_date was 2015-11-16 – Kim_Turtle Mar 03 '20 at 20:56
  • so basically if year and month value of ppd_df['deed_date'] matches the value of ave_df['price_date'] then do the require operation ? – Prish Mar 03 '20 at 20:58
  • Yes, the year and month values of ppd_df['deed_date'] need to match ave_df['price_date']. The current edit produces "ValueError: Can only compare identically-labeled Series objects" – Kim_Turtle Mar 03 '20 at 21:16
  • can you upload both the CSVs somewhere with sample data of 10-15 rows? – Prish Mar 03 '20 at 21:17
  • It is most likely due to the mismatched indexes. – Prish Mar 03 '20 at 21:20
  • Sorry Prish, I'm very new to this. Where should I upload csvs? – Kim_Turtle Mar 03 '20 at 21:29
  • Umm, can you try resetting index of both the dataframes before above code? i.e. df = df.reset_index(drop=true) – Prish Mar 03 '20 at 21:33
  • Resetting the index hasn't helped – Kim_Turtle Mar 03 '20 at 21:37
  • I'm pretty sure code is working just fine. I tested it with the dummy data similar to given above. Could you google the error and see what's wrong? or post the full error log including which line is getting error etc. – Prish Mar 03 '20 at 21:47