Pandas create new dataframe by querying other dataframes without using iterrows

Question

I have two huge dataframes that both have the same id field. I want to make a simple summary dataframe where I show the maximum of specific columns. I understand iterrows() is frowned upon, so are a couple one-liners to do this? I don't understand lambda/apply very well, but maybe this would work here.

Stand-alone example

import pandas as pd

myid = [1,1,2,3,4,4,5]
name =['A','A','B','C','D','D','E']
x = [15,12,3,3,1,4,8]
df1 = pd.DataFrame(list(zip(myid, name, x)), 
                  columns=['myid', 'name', 'x'])
display(df1)

myid = [1,2,2,2,3,4,5,5]
name =['A','B','B','B','C','D','E','E']
y = [9,6,3,4,6,2,8,2]
df2 = pd.DataFrame(list(zip(myid, name, y)), 
                  columns=['myid', 'name', 'y'])
display(df2)

mylist = df['myid'].unique()
df_summary = pd.DataFrame(mylist, columns=['MY_ID'])
## do work here...

Desired output

score 2 · Answer 1 · answered Feb 19 '21 at 18:54

2

you can try concat+groupby.max

out = (pd.concat((df1,df2),sort=False).groupby(['myid','name']).max()
         .add_prefix("Max_").reset_index())

   myid name  Max_x  Max_y
0     1    A   15.0    9.0
1     2    B    3.0    6.0
2     3    C    3.0    6.0
3     4    D    4.0    2.0
4     5    E    8.0    8.0

answered Feb 19 '21 at 18:54

anky

74,114
11
41
70

thanks for start on this. the real dataframes have dozens of columns, so how do I specify which column I want the max for? and will also need to do it for minimum. sorry for leaving this out of the OP, I was trying to keep it simple – a11 Feb 19 '21 at 19:00
@a11 thats okay , can you edit your question with an example for your query. Check Robs answer as well. I mighht log out but if you dont get an answer will be back in the morning IST – anky Feb 19 '21 at 19:05

Rob Raymond · Accepted Answer · 2021-02-19T19:52:40.893

2

merge()
named aggregations

df1.merge(df2, on=["myid","name"], how="outer")\
.groupby(["myid","name"], as_index=False).agg(MAX_X=("x","max"),MAX_Y=("y","max"))

	myid	name	MAX_X	MAX_Y
0	1	A	15	9
1	2	B	3	6
2	3	C	3	6
3	4	D	4	2
4	5	E	8	8

updated

you have noted that your data frames are large and solution is giving you OOM
logically aggregate first, then merge will use less memory

pd.merge(
    df1.groupby(["myid","name"],as_index=False).agg(MAX_X=("x","max")),
    df2.groupby(["myid","name"],as_index=False).agg(MAX_Y=("y","max")),
    on=["myid","name"]
)

edited Feb 19 '21 at 19:52

answered Feb 19 '21 at 18:59

Rob Raymond

29,118
3
14
30

thanks @Rob. This is good for the OP examples, but I tried this with the full dataframes and received the error `Unable to allocate 123. GiB for an array with shape (34, 487520935) and data type int64` – a11 Feb 19 '21 at 19:08
not sure if it will help, you could try an inner or left join depending on your data. I choose outer (most expensive) as it won't drop data but is more expensive – Rob Raymond Feb 19 '21 at 19:26
have updated - reduce first, merge second – Rob Raymond Feb 19 '21 at 19:53
ha, just posted the same conclusion! I'll drop my answer and accept yours, thank you for your time! – a11 Feb 19 '21 at 19:57

Pandas create new dataframe by querying other dataframes without using iterrows

2 Answers2

updated