In pandas, how can I drop all of the rows in a dataframe that don't contain a specific column value in another dataframe?

Question

I have two dataframes:

One is a decently large dataset of a million or so records. This df contains user information including a "list_id" that they're associated with.

The next dataframe is simply a single column of all of the "list" IDs that we want to keep in dataframe #1. There are about 1000 of these IDs in this dataframe (#2).

The end goal is to drop every single row in dataframe #1 if the ID next that user is not contained in dataframe #2.

What I've tried so far is to iterate on the rows in dataframe #1 in a for loop with a secondary for loop iterating on the IDs in dataframe #2. If it doesn't the first for loop doesn't find the ID in dataframe #2, it drops the row. Here's what I've tried:

df1:

user_id	list_id
1234	234235
4567	754654
9012	345445

df2:

list_ids
234235
754654
123456

for index, row in df1.iterrows():
    if row['list_id'] not in df2['list_ids']:
        #Remove user if not associated to list
        df1.drop(index=index, inplace=True)

The biggest problem with this is how large the data set is. This took about 8 hours to run, only to find out that it actually dropped every single row in df1. I feel like this is not the best approach to this problem and there's got to be a more "pandas-like" way to solve this issue. If any more information is needed, I'd be happy to add it!

`df1.loc[df1['list_id'].isin(df2['list_ids'].tolist())]` – rhug123 Jun 14 '23 at 03:05 — rhug123, Jun 14 '23 at 03:05

In pandas, how can I drop all of the rows in a dataframe that don't contain a specific column value in another dataframe?

0 Answers0