1

I have two huge dataframes that both have the same id field. I want to make a simple summary dataframe where I show the maximum of specific columns. I understand iterrows() is frowned upon, so are a couple one-liners to do this? I don't understand lambda/apply very well, but maybe this would work here.

Stand-alone example

import pandas as pd

myid = [1,1,2,3,4,4,5]
name =['A','A','B','C','D','D','E']
x = [15,12,3,3,1,4,8]
df1 = pd.DataFrame(list(zip(myid, name, x)), 
                  columns=['myid', 'name', 'x'])
display(df1)

myid = [1,2,2,2,3,4,5,5]
name =['A','B','B','B','C','D','E','E']
y = [9,6,3,4,6,2,8,2]
df2 = pd.DataFrame(list(zip(myid, name, y)), 
                  columns=['myid', 'name', 'y'])
display(df2)

mylist = df['myid'].unique()
df_summary = pd.DataFrame(mylist, columns=['MY_ID'])
## do work here...

enter image description here

Desired output

enter image description here

a11
  • 3,122
  • 4
  • 27
  • 66

2 Answers2

2

you can try concat+groupby.max

out = (pd.concat((df1,df2),sort=False).groupby(['myid','name']).max()
         .add_prefix("Max_").reset_index())

   myid name  Max_x  Max_y
0     1    A   15.0    9.0
1     2    B    3.0    6.0
2     3    C    3.0    6.0
3     4    D    4.0    2.0
4     5    E    8.0    8.0
anky
  • 74,114
  • 11
  • 41
  • 70
  • thanks for start on this. the real dataframes have dozens of columns, so how do I specify which column I want the max for? and will also need to do it for minimum. sorry for leaving this out of the OP, I was trying to keep it simple – a11 Feb 19 '21 at 19:00
  • @a11 thats okay , can you edit your question with an example for your query. Check Robs answer as well. I mighht log out but if you dont get an answer will be back in the morning IST – anky Feb 19 '21 at 19:05
2
  • merge()
  • named aggregations
df1.merge(df2, on=["myid","name"], how="outer")\
.groupby(["myid","name"], as_index=False).agg(MAX_X=("x","max"),MAX_Y=("y","max"))

myid name MAX_X MAX_Y
0 1 A 15 9
1 2 B 3 6
2 3 C 3 6
3 4 D 4 2
4 5 E 8 8

updated

  • you have noted that your data frames are large and solution is giving you OOM
  • logically aggregate first, then merge will use less memory
pd.merge(
    df1.groupby(["myid","name"],as_index=False).agg(MAX_X=("x","max")),
    df2.groupby(["myid","name"],as_index=False).agg(MAX_Y=("y","max")),
    on=["myid","name"]
)
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • thanks @Rob. This is good for the OP examples, but I tried this with the full dataframes and received the error `Unable to allocate 123. GiB for an array with shape (34, 487520935) and data type int64` – a11 Feb 19 '21 at 19:08
  • not sure if it will help, you could try an inner or left join depending on your data. I choose outer (most expensive) as it won't drop data but is more expensive – Rob Raymond Feb 19 '21 at 19:26
  • have updated - reduce first, merge second – Rob Raymond Feb 19 '21 at 19:53
  • ha, just posted the same conclusion! I'll drop my answer and accept yours, thank you for your time! – a11 Feb 19 '21 at 19:57