I have two dataframes:
One is a decently large dataset of a million or so records. This df contains user information including a "list_id" that they're associated with.
The next dataframe is simply a single column of all of the "list" IDs that we want to keep in dataframe #1. There are about 1000 of these IDs in this dataframe (#2).
The end goal is to drop every single row in dataframe #1 if the ID next that user is not contained in dataframe #2.
What I've tried so far is to iterate on the rows in dataframe #1 in a for loop with a secondary for loop iterating on the IDs in dataframe #2. If it doesn't the first for loop doesn't find the ID in dataframe #2, it drops the row. Here's what I've tried:
df1:
user_id | list_id |
---|---|
1234 | 234235 |
4567 | 754654 |
9012 | 345445 |
df2:
list_ids |
---|
234235 |
754654 |
123456 |
for index, row in df1.iterrows():
if row['list_id'] not in df2['list_ids']:
#Remove user if not associated to list
df1.drop(index=index, inplace=True)
The biggest problem with this is how large the data set is. This took about 8 hours to run, only to find out that it actually dropped every single row in df1. I feel like this is not the best approach to this problem and there's got to be a more "pandas-like" way to solve this issue. If any more information is needed, I'd be happy to add it!