Filtering and grouping rows in one DataFrame, by another DataFrame

Question

I have two DFs. I want to iterate through rows in DF1 and filter all the rows in DF2 with same id and get column"B" value in new columns of DF1.

data = {'id': [1,2,3]}
df1 = pd.DataFrame(data)

data = {'id': [1, 1, 3,3,3], 'B': ['ab', 'bc','ad','ds','sd']}
df2 = pd.DataFrame(data)

DF1 - id (15k rows) DF2 - id, col1 (50M rows)

Desired output

data = {'id': [1,2,3],'B':['[ab,bc]','[]','[ad,ds,sd]']}
pd.DataFrame(data)

def func(df1):
    temp3=df2.merge(pd.DataFrame(data=[df1.values]*len(df1),columns=df1.index),how='right',on= 
    ['id'])
    temp1 = temp3.B.values
    return temp1
df1['B']=df1.apply(func,axis=1))

I am using merge for filtering and applying lambda function on df1. The code is taking 1 hour to execute on large data frame. How to make this run faster ?

id=2 has an extra item in its list in the output, not sure that makes sense. — cs95, Jul 07 '20 at 21:20
`df1` has the same IDs as `df2`. Is that a coincidence? If not, why do you need `df1` at all? If it is, could you add some other IDs to either dataframe? — DYZ, Jul 07 '20 at 21:20
Also, do you really want to have list-looking _strings_ as the output? — DYZ, Jul 07 '20 at 21:24
@DYZ: edited the output. I dont need strings as the output. Just array of all the associated B column values with the id — krishna agrawal, Jul 07 '20 at 21:34

score 3 · Answer 1 · answered Jul 07 '20 at 21:24

3

Are you looking for a simple filter and grouped listification?

df2[df2['id'].isin(df1['id'])].groupby('id', as_index=False)[['B']].agg(list)

   id             B
0   1      [ab, bc]
1   2      [ca, as]
2   3  [ad, ds, sd]

Note that grouping as lists is considered suboptimal in terms of performance.

answered Jul 07 '20 at 21:24

cs95

379,657
97
704
746

when you say sub-optimal what you are comparing it agianst? strings? – Umar.H Jul 07 '20 at 21:34
This is great solution. I hope it works pretty fast with 50M rows. Although, all the id's from df1 will not be in df2. So, maybe i need to create an additional merge after this step. – krishna agrawal Jul 07 '20 at 21:45
2

@Datanovice as compared to dtypes that are vectorizable (essentially anything that is not string or object). – cs95 Jul 07 '20 at 22:15
@krishnaagrawal the `[df2['id'].isin(df1['id'])]` step in my code accounts for that and filters out unmatched Ids beforehand, or am I mistaken? – cs95 Jul 07 '20 at 22:16
this output will only include the id's which are common in both the df's. so, i am merging this result with df1 and replacing null values with the '[]' to get the complete list. df1 = df1.merge(result,how='left', on=['id']) df1.loc[df['B'].isnull(),['B']] = df1.loc[df1['B'].isnull(),'B'].apply(lambda x: []) – krishna agrawal Jul 07 '20 at 22:29

Filtering and grouping rows in one DataFrame, by another DataFrame

1 Answers1