0

So I have a dataframe with all the boroughs in London with their average house price from the years 1995-2021.

What I am trying to do is compile a new dataframe that takes the most expensive borough for each year.

The column names for the original df are: [London_Borough, ID, Average_price, Year]

At first I figured I can loop by each year and create a temporary df where I assign each borough and its price for that particular year and from there I extract the max value for the average price.

For example:

for i in range(1995, 2022, 1):
    temp = df[df['Year'] == i]
    yr_max = temp['Average_price'].max()

the problem with this is that while I get the most expensive borough for that year, all I have is the number without the corresponding borough associated with it.

Is there any way I can extract the entire row? or at least just the borough and the price?

This honestly might just be a simple syntax problem but I have scoured over my notes and online resources but cannot find a way to locate a row given a particular value of one column.

The only solution I could think of is to first reset the index of temporary df, then create a list of average prices of that year, loop through the list until it matches the max price then use the index of that list to locate the index of the temporary df but that is not an acceptable solution as it is over complicated and does not abide by Ohm law as the course I am taking is for data science and so efficiency is principle.

2 Answers2

0

This will be a complete change to my original answer. You can use groupby to create a DataFrame containing 'Year' and 'Average_price' and use indexes to merge it with the original DataFrame:

df = pd.DataFrame([
                 [1999, 1252, "Barnet"], 
                 [1999, 1525, "Enfield"], 
                 [2001, 1524, "Camden"]],
     columns = ['year', 'price', 'london_borough'])
idx = df.groupby('year').agg({'price':'max'}).reset_index().set_index(['year', 'price'])
df.set_index(['year', 'price'], inplace = True)

And merge the two DataFrames on index from idx:

df = df.merge(idx, left_index = True, right_index = True, how = 'right')

You can also avoid setting indexes and use column names.

Ostap Orishko
  • 61
  • 1
  • 6
0

If I understand what you want correctly, you can use one of these two approaches:

  1. Approach: keeping your loop (not recommended see this post):
for i in range(1995, 2022, 1):
    temp = df[df['Year'] == i]
    yr_max = temp[temp['Average_price'] == temp['Average_price'].max()]
  1. Approach (use pandas built in methods):
df.iloc[df.groupby(['Year'])['Average_price'].idxmax()]

for example using the following input:

    Year  Average_price london_borough
0   1999           1320         Barnet
1   1999            810        Enfield
2   1999           2250         Ealing
3   2000           1524         Bexley
4   2000            810         Camden
5   2000           1524          Brent
6   2001           1524         Barnet
7   2001           2540         Barnet
8   2001            810         Ealing
9   2002           1524         Camden
10  2002           3000         Ealing
11  2002           1524          Brent

you'll get the output:

>>> print(df.iloc[df.groupby(['Year'])['Average_price'].idxmax()])

    Year  Average_price london_borough
2   1999           2250         Ealing
3   2000           1524         Bexley
7   2001           2540         Barnet
10  2002           3000         Ealing

And if you want to access a specific year you can do:

>>> yr_max = df.iloc[df.groupby(['Year'])['Average_price'].idxmax()]
>>> yr_max[yr_max['Year'] == 1999]

   Year  Average_price london_borough
2  1999           2250         Ealing
99_m4n
  • 1,239
  • 3
  • 17