0

I have two dataframes:

One is a decently large dataset of a million or so records. This df contains user information including a "list_id" that they're associated with.

The next dataframe is simply a single column of all of the "list" IDs that we want to keep in dataframe #1. There are about 1000 of these IDs in this dataframe (#2).

The end goal is to drop every single row in dataframe #1 if the ID next that user is not contained in dataframe #2.

What I've tried so far is to iterate on the rows in dataframe #1 in a for loop with a secondary for loop iterating on the IDs in dataframe #2. If it doesn't the first for loop doesn't find the ID in dataframe #2, it drops the row. Here's what I've tried:

df1:

user_id list_id
1234 234235
4567 754654
9012 345445

df2:

list_ids
234235
754654
123456
for index, row in df1.iterrows():
    if row['list_id'] not in df2['list_ids']:
        #Remove user if not associated to list
        df1.drop(index=index, inplace=True)

The biggest problem with this is how large the data set is. This took about 8 hours to run, only to find out that it actually dropped every single row in df1. I feel like this is not the best approach to this problem and there's got to be a more "pandas-like" way to solve this issue. If any more information is needed, I'd be happy to add it!

0 Answers0