2

I have two DFs. I want to iterate through rows in DF1 and filter all the rows in DF2 with same id and get column"B" value in new columns of DF1.

data = {'id': [1,2,3]}
df1 = pd.DataFrame(data)

data = {'id': [1, 1, 3,3,3], 'B': ['ab', 'bc','ad','ds','sd']}
df2 = pd.DataFrame(data)

DF1 - id (15k rows) DF2 - id, col1 (50M rows)

Desired output

data = {'id': [1,2,3],'B':['[ab,bc]','[]','[ad,ds,sd]']}
pd.DataFrame(data)

def func(df1):
    temp3=df2.merge(pd.DataFrame(data=[df1.values]*len(df1),columns=df1.index),how='right',on= 
    ['id'])
    temp1 = temp3.B.values
    return temp1
df1['B']=df1.apply(func,axis=1))

I am using merge for filtering and applying lambda function on df1. The code is taking 1 hour to execute on large data frame. How to make this run faster ?

cs95
  • 379,657
  • 97
  • 704
  • 746

1 Answers1

3

Are you looking for a simple filter and grouped listification?

df2[df2['id'].isin(df1['id'])].groupby('id', as_index=False)[['B']].agg(list)

   id             B
0   1      [ab, bc]
1   2      [ca, as]
2   3  [ad, ds, sd]

Note that grouping as lists is considered suboptimal in terms of performance.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • when you say sub-optimal what you are comparing it agianst? strings? – Umar.H Jul 07 '20 at 21:34
  • This is great solution. I hope it works pretty fast with 50M rows. Although, all the id's from df1 will not be in df2. So, maybe i need to create an additional merge after this step. – krishna agrawal Jul 07 '20 at 21:45
  • 2
    @Datanovice as compared to dtypes that are vectorizable (essentially anything that is not string or object). – cs95 Jul 07 '20 at 22:15
  • @krishnaagrawal the `[df2['id'].isin(df1['id'])]` step in my code accounts for that and filters out unmatched Ids beforehand, or am I mistaken? – cs95 Jul 07 '20 at 22:16
  • this output will only include the id's which are common in both the df's. so, i am merging this result with df1 and replacing null values with the '[]' to get the complete list. df1 = df1.merge(result,how='left', on=['id']) df1.loc[df['B'].isnull(),['B']] = df1.loc[df1['B'].isnull(),'B'].apply(lambda x: []) – krishna agrawal Jul 07 '20 at 22:29