0

I have df1 with 200k records and df2 with 9k records. I want to check whether email isd are same or not and if it is same then check unique ids and if unique ids were different then replace df1 unique id with df2 unique id.

I tried this code but it is taking too much time. How to reduce its time?

   ...:     for index2, row2 in df2.iterrows():
   ...:          #print(row2['Email'])           
   ...:          if row1['email']==row2['Email']:    
   ...:            #print(row1['email'],row2['Email'])            
   ...:            if int(row1['unique_ID'])!=int(row2['unique_ID']):
   ...:              print("yes")                           
   ...:              row1['unique_ID']=row2['unique_ID']
   ...:              df1.loc[index1]=row1```


James Z
  • 12,209
  • 10
  • 24
  • 44
Ad_sh
  • 37
  • 5
  • 2
    I would help a lot if you posted a sample of your data, see this [page](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for some help. – joao Jul 03 '21 at 11:32
  • yes it is helpful. I am creating df1 is big query table data having columns ['id', 'unique ID', 'name', 'lastname', 'firstname', 'email']. and df2 is csv file having same columns. I am not sure whether I can share sample data. But I hope you understand the format. – Ad_sh Jul 03 '21 at 14:35
  • 2
    If you cannot share sample data, you will get a lot less help. The first thing that people here want to do is copy/paste your question to run your code, that makes it very easy for them to understand your issue, but without data it's impossible. Your dataframes have 6 columns, that's not much, and 5 or 6 rows of data would be enough. The link I sent you has examples of simple ways to post data so it can be copied/pasted. – joao Jul 03 '21 at 14:59
  • @joao is right. Try to anonymize the data. It facilitates for people here to help you. – Maicon Mauricio Jul 03 '21 at 16:37

1 Answers1

0

Iterrows is very slow as you have seen. Use merge, groupby, and filtering to find the relevant rows and change the ID for all rows at the same time. You can use groupby and then count how many unique IDs there are per email. Here's a toy example:

row1list = ['stack', '10']
row2list = ['overflow', '20']
row3list = ['overflow', '30']
df1 = pd.DataFrame([row1list, row2list, row3list], columns=['email', 'unique_ID'])

row1list = ['stack', '10']
row2list = ['overflow', '40']
df2 = pd.DataFrame([row1list, row2list], columns=['email', 'unique_ID'])

df_conflicting_ids = df1.groupby('email', as_index=False).agg({
    'unique_ID': lambda x: len(pd.Series.unique(x))})

df_conflicting_ids = df_conflicting_ids.rename(columns={'unique_ID':'unique_ID_count'})
df_conflicting_ids = df_conflicting_ids[df_conflicting_ids['unique_ID_count'] > 1]

print(df_conflicting_ids)
#       email  unique_ID_count
# 0  overflow                2

del df_conflicting_ids['unique_ID_count']  # don't need column anymore

df_conflicting_ids = df_conflicting_ids.merge(df2, on='email', how='left')
df_conflicting_ids = df_conflicting_ids.rename(columns={'unique_ID':'master_unique_ID'})

df1 = df1.merge(df_conflicting_ids, on='email', how='left')
df1.loc[df1['master_unique_ID'].notnull(), 'unique_ID'] = df1['master_unique_ID']
print(df1)
#       email unique_ID master_unique_ID
# 0     stack        10              NaN
# 1  overflow        40               40
# 2  overflow        40               40

del df1['master_unique_ID']  # don't need column anymore

I'm not sure if you want to drop duplicates after you overwrite the unique_IDs. Also, you may want to store your unique_ID as integer, since you are testing after converting to integer.

Hammurabi
  • 1,141
  • 1
  • 4
  • 7