I have a pandas DataFrame with more than 1.1 million rows.
My code needs to achieve the following:
given a list of ids, lets say:
ids = [1, 5, 8, 46, 55, 57, 143, 1003, 1564, ..., ]
and my huge pandas DataFrame which contains a column with id values:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8, 9, ... , 1000000],
"some_value": ["a", "b", "a", "c", "c", "a", "a", "d", "f", ... , "a"],
"another_value": ["x", "y", "x", "z", "q", "x", "x", "x", "z", ... , "y"]
})
(The DataFrame is ordered along the id column, should that help)
I want to add a Boolean column selected
which contains True
if the id
value is present in ids
, otherwise False
. The resulting DataFrame should have the column selected
like this:
[True, False, False, False, True, False, False, True, False, ... ]
Currently I implemented it this way:
df["selected"] = False
for i in segment_ids:
df.loc[df["id"] == i, "selected"] = True
It works like a charm, but the code takes about 20 minutes to run for my 1.1M rows DataFrame, which is very inconvenient.
How to achieve my goal in the least time-consuming way? Ideally I would like it to run in only a few minutes, but I don't know if this is possible.