Pandas: Large DF and optimizing isin search?

Question

I have a df consisting of many millions of rows. I need to run a recursive procedure which basically runs this repeatedly until a condition exhausts itself.

# df index is set to the search column -- this helps a lot, sorting actually hurts performance (surprisingly?)
df = df.set_index('search_col')

# the search function; pull some cols of interest
df[df.index.isin(ids_to_search)][['val1', 'val2']].to_numpy()

Recursion happens because I need to find all the children IDs associated with one ultimate parent ID. The process is as follows:

Load single parent ID
Search for its children IDs
Use step 2 children IDs as new parent IDs
Search for its children IDs
Repeat 3+ until no more children IDs are found

The above is not bad, but with thousands of things to check, n times with recursion, its a slow process at the end of the day.

ids_to_search consists of length 32 random strings in a list, sometimes involving dozens or hundreds of strings to check.

What other tricks might I try to employ?

Edit: Other Attempts

Other attempts that I have done, which did not perform better are:

Using modin, leveraging the Dask engine
Swifter + modin, leveraging the Dask engine
Swapping pandas isin (and the dataframe to fully numpy, too) with numpy's np.in1d, ultimately to use JIT/Numba but I could not get it to work

It might help if you do an initial sort of the rows and use the [searchsorted](https://pandas.pydata.org/docs/reference/api/pandas.Series.searchsorted.html) method — Ben Grossmann, Jan 07 '22 at 19:16
It would help if you could show a sample of the data you are dealing with? For example, what is the contents of the 'search_col' does it contain a single entry or multiple entries? — itprorh66, Jan 07 '22 at 19:31
Try the answer to this post: https://stackoverflow.com/questions/23945493/a-faster-alternative-to-pandas-isin-function. I believe the join method is quickest based on the analysis. — flow_me_over, Jan 07 '22 at 21:20
Join and merge both perform worse than the above, for me. It is also 7 years old. — John Stud, Jan 08 '22 at 20:38
`I need to run a recursive procedure which basically runs this repeatedly until a condition exhausts itself.` It might be useful to explain why this runs recursively, and what it's intended to do, from a high level perspective. It may be possible to re-use a past search, or search fewer times, but we can't make that suggestion without knowing *why* the search is being done. — Nick ODell, Jan 08 '22 at 20:48

Pandas: Large DF and optimizing isin search?

Edit: Other Attempts

0 Answers0