0

I'm currently working on a project where I'm dealing with somewhat large data sets. I have a dataframe transactions and another users.

Iterating through the transactions dataframe is no problem. I used timeit and it takes just under a minute to do so. My second dataframe users has 1,000 rows. Both of these dataframes have a column email. Essentially what i'm trying to do is get the userId in the row in users that matches the email in each transactions row. My current approach looks like this:

for row in transactions.itertuples():
    userId = users[users['email'] == getattr(row, 'email')]['userId'].values[0]

This simple lookup works, however it's too slow for my use case. I kept it running for over an hour and it still wasn't finished running. I'm wondering if there's potentially a faster way to do this lookup (maybe get the runtime down to minutes instead of hours)?

Appreciate any help in advance!

Harrison
  • 5,095
  • 7
  • 40
  • 60
  • 1
    Have you checked out the Pandas merge function? You could use an inner join on email. Merge information at [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101). – DarrylG Feb 06 '21 at 19:29
  • @DarrylG This is EXACTLY what I was looking for! Just tested it and it creates a new dataframe with the transactions including the matching userIds. So all I have to do is drop the columns that I don't want and i'm set. Thanks so much! – Harrison Feb 06 '21 at 19:46
  • @Harrison--glad I could help. Yep, I've used it several times in the way you mentioned. It's very fast. – DarrylG Feb 06 '21 at 19:49

0 Answers0