1

Here is a dummy example of the DF I'm working with ('ETC' represents several columns):

df = pd.DataFrame(data={'PlotCode':['A','A','A','A','B','B','B','C','C'],
                        'INVYR':[2000,2000,2000,2005,1990,2000,1990,2005,2001],
                        'ETC':['a','b','c','d','e','f','g','h','i']})

picture of df (sorry not enough reputation yet)

And here is what I want to end up with:

df1 = pd.DataFrame(data={'PlotCode':['A','A','A','B','B','C'],
                        'INVYR':[2000,2000,2000,1990,1990,2001],
                        'ETC':['a','b','c','e','g','i']})

picture of df1

NOTE: I want ALL rows with minimum 'INVYR' values for each 'PlotCode', not just one or else I'm assuming I could do something easier with drop_duplicates and sort.

So far, following the answer here Appending pandas dataframes generated in a for loop I've tried this with the following code:

df1 = []

for i in df['PlotCode'].unique():
    j = df[df['PlotCode']==i]
    k = j[j['INVYR']==j['INVYR'].min()]
    df1.append(k)

df1 = pd.concat(df1)

This code works but is very slow, my actual data contains some 40,000 different PlotCodes so this isn't a feasible solution. Does anyone know some smooth filtering way of doing this? I feel like I'm missing something very simple.

Thank you in advance!

Dexoryte
  • 17
  • 6
  • `df.groupby(['PlotCode']).apply(lambda x: x[x.INVYR==x.INVYR.min()]).reset_index(drop=True)` -- I though this would be faster - but it's somehow slower (at least on a very small dataset, maybe it'll be faster on a larger one) – avloss Nov 14 '20 at 10:43

1 Answers1

2

Try not to use for loops when using pandas, they are extremely slow in comparison to the vectorized operations that pandas has.

Solution 1:
Determine the minimum INVYR for every plotcode, using .groupby():

min_invyr_per_plotcode = df.groupby('PlotCode', as_index=False)['INVYR'].min()

And use pd.merge() to do an inner join between your orignal df with this minimum you just found:

result_df = pd.merge(
    df, 
    min_invyr_per_plotcode, 
    how='inner', 
    on=['PlotCode', 'INVYR'],
)

Solution 2:

Again, determine the minimum per group, but now add it as a column to your dataframe. This minimum per group gets added to every row by using .groupby().transform()

df['min_per_group'] = (df
    .groupby('PlotCode')['INVYR']
    .transform('min')
)

Now filter your dataframe where INVYR in a row is equal to the minimum of that group:

df[df['INVYR'] == df['min_per_group']]
Sander van den Oord
  • 10,986
  • 5
  • 51
  • 96
  • Amazing! Knew there had to be a better vectorized way but couldn't quite work it out. Thank you so much! – Dexoryte Nov 14 '20 at 10:59