1

I have a df consisting of many millions of rows. I need to run a recursive procedure which basically runs this repeatedly until a condition exhausts itself.

# df index is set to the search column -- this helps a lot, sorting actually hurts performance (surprisingly?)
df = df.set_index('search_col')

# the search function; pull some cols of interest
df[df.index.isin(ids_to_search)][['val1', 'val2']].to_numpy()

Recursion happens because I need to find all the children IDs associated with one ultimate parent ID. The process is as follows:

  1. Load single parent ID
  2. Search for its children IDs
  3. Use step 2 children IDs as new parent IDs
  4. Search for its children IDs
  5. Repeat 3+ until no more children IDs are found

The above is not bad, but with thousands of things to check, n times with recursion, its a slow process at the end of the day.

ids_to_search consists of length 32 random strings in a list, sometimes involving dozens or hundreds of strings to check.

What other tricks might I try to employ?

Edit: Other Attempts

Other attempts that I have done, which did not perform better are:

  1. Using modin, leveraging the Dask engine
  2. Swifter + modin, leveraging the Dask engine
  3. Swapping pandas isin (and the dataframe to fully numpy, too) with numpy's np.in1d, ultimately to use JIT/Numba but I could not get it to work
John Stud
  • 1,506
  • 23
  • 46
  • 2
    It might help if you do an initial sort of the rows and use the [searchsorted](https://pandas.pydata.org/docs/reference/api/pandas.Series.searchsorted.html) method – Ben Grossmann Jan 07 '22 at 19:16
  • It would help if you could show a sample of the data you are dealing with? For example, what is the contents of the 'search_col' does it contain a single entry or multiple entries? – itprorh66 Jan 07 '22 at 19:31
  • 2
    Try the answer to this post: https://stackoverflow.com/questions/23945493/a-faster-alternative-to-pandas-isin-function. I believe the join method is quickest based on the analysis. – flow_me_over Jan 07 '22 at 21:20
  • 1
    Join and merge both perform worse than the above, for me. It is also 7 years old. – John Stud Jan 08 '22 at 20:38
  • 1
    `I need to run a recursive procedure which basically runs this repeatedly until a condition exhausts itself.` It might be useful to explain why this runs recursively, and what it's intended to do, from a high level perspective. It may be possible to re-use a past search, or search fewer times, but we can't make that suggestion without knowing *why* the search is being done. – Nick ODell Jan 08 '22 at 20:48

0 Answers0