I'm currently working on a project where I'm dealing with somewhat large data sets. I have a dataframe transactions
and another users
.
Iterating through the transactions dataframe is no problem. I used timeit
and it takes just under a minute to do so. My second dataframe users
has 1,000 rows. Both of these dataframes have a column email
. Essentially what i'm trying to do is get the userId
in the row in users
that matches the email
in each transactions
row. My current approach looks like this:
for row in transactions.itertuples():
userId = users[users['email'] == getattr(row, 'email')]['userId'].values[0]
This simple lookup works, however it's too slow for my use case. I kept it running for over an hour and it still wasn't finished running. I'm wondering if there's potentially a faster way to do this lookup (maybe get the runtime down to minutes instead of hours)?
Appreciate any help in advance!