0

I have a pandas DataFrame with more than 1.1 million rows.

My code needs to achieve the following:

given a list of ids, lets say:

ids = [1, 5, 8, 46, 55, 57, 143, 1003, 1564, ..., ]

and my huge pandas DataFrame which contains a column with id values:

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8, 9, ... , 1000000],
                   "some_value": ["a", "b", "a", "c", "c", "a", "a", "d", "f", ... , "a"],
                   "another_value": ["x", "y", "x", "z", "q", "x", "x", "x", "z", ... , "y"]
                  })

(The DataFrame is ordered along the id column, should that help)

I want to add a Boolean column selected which contains True if the id value is present in ids, otherwise False. The resulting DataFrame should have the column selected like this:

[True, False, False, False, True, False, False, True, False, ... ]

Currently I implemented it this way:

df["selected"] = False

for i in segment_ids:
    df.loc[df["id"] == i, "selected"] = True

It works like a charm, but the code takes about 20 minutes to run for my 1.1M rows DataFrame, which is very inconvenient.

How to achieve my goal in the least time-consuming way? Ideally I would like it to run in only a few minutes, but I don't know if this is possible.

Peter
  • 722
  • 6
  • 24
  • 3
    you are looking for [`series.isin`](https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html): `df["selected"] = df['id'].isin(ids)` – anky Apr 17 '21 at 14:46
  • @anky is that the fastest way possible? – Peter Apr 17 '21 at 14:49
  • 1
    Yes for a dataframe, this is usually the way to go. you can try for yourself and see – anky Apr 17 '21 at 14:50
  • 1
    Related : [How to return list of booleans to see if elements of one list in another list](https://stackoverflow.com/questions/14430454/how-to-return-list-of-booleans-to-see-if-elements-of-one-list-in-another-list) and [How to filter Pandas dataframe using 'in' and 'not in' like in SQL](https://stackoverflow.com/questions/19960077/how-to-filter-pandas-dataframe-using-in-and-not-in-like-in-sql) – anky Apr 17 '21 at 14:56
  • @anky I works amazingly fast, thank you! Should you want to write it in an answer, I will accept it to give your some reps – Peter Apr 17 '21 at 15:01
  • Peter, thank you for the consideration :) However, I believe this has been asked before. It would be great if you consider answering the question and accepting it so as to close it. – anky Apr 17 '21 at 15:03

1 Answers1

0

The solution is to use

df["selected"] = df['id'].isin(ids)
Peter
  • 722
  • 6
  • 24