I have a df consisting of many millions of rows. I need to run a recursive procedure which basically runs this repeatedly until a condition exhausts itself.
# df index is set to the search column -- this helps a lot, sorting actually hurts performance (surprisingly?)
df = df.set_index('search_col')
# the search function; pull some cols of interest
df[df.index.isin(ids_to_search)][['val1', 'val2']].to_numpy()
Recursion happens because I need to find all the children IDs associated with one ultimate parent ID. The process is as follows:
- Load single parent ID
- Search for its children IDs
- Use step 2 children IDs as new parent IDs
- Search for its children IDs
- Repeat 3+ until no more children IDs are found
The above is not bad, but with thousands of things to check, n times with recursion, its a slow process at the end of the day.
ids_to_search
consists of length 32 random strings in a list, sometimes involving dozens or hundreds of strings to check.
What other tricks might I try to employ?
Edit: Other Attempts
Other attempts that I have done, which did not perform better are:
- Using
modin
, leveraging the Dask engine - Swifter +
modin
, leveraging the Dask engine - Swapping pandas
isin
(and the dataframe to fully numpy, too) with numpy'snp.in1d
, ultimately to use JIT/Numba but I could not get it to work