0

I have two pandas DataFrames that look like this:

x = pd.DataFrame({'A': {0: 'other info', 1: 'other info', 2: 'other info', 3: 'other info', 4: 'other info'},
 'B': {0: '1', 1: '1', 2: '2', 3: '2', 4: '3'},
 'C': {0: 100, 1: 160, 2: 120, 3: 130, 4: 170}})

y = pd.DataFrame({'B': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1'},
 'start': {0: 90, 1: 100, 2: 125, 3: 150, 4: 160},
 'end': {0: 110, 1: 115, 2: 135, 3: 160, 4: 200},
 'info': {0: 'info', 1: 'info', 2: 'info', 3: 'info', 4: 'info'}})

Both DataFrames are sorted. x is sorted first on B then on C. y is sorted on B then on start. Both DataFrames are about 20 million rows long.

The goal is to find data within x that has an entry within y that has a matching column B and column C that is >= start and <= end.

merged_result = pd.DataFrame({
     'A': {0: 'other info', 1: 'other info'},
     'B': {0: '1', 1: '1'},
     'C': {0: 100, 1: 100}
     'start': {0: 90, 1: 105},
     'end': {0: 110, 1: 115},
     'info': {0: 'info', 1: 'info'}})

I am currently doing this with the following code, but it takes about 0.3 seconds per row which takes far too long to run through the dataset. Is there a much faster way to do the same thing?

import time
start_time = time.time()

for index, row in x.head().iterrows():
    merged = y[(y.start <= row.POS) &
                     (y.end >= row.POS) &
                     (y.CHROM == row.CHROM)].merge(pd.DataFrame(row).T, on='B', how='left')

    out = out.append(merged).drop_duplicates(subset=['B',
                                                     'C',
                                                    ])
print("--- %s seconds ---" % (time.time() - start_time))
The Nightman
  • 5,609
  • 13
  • 41
  • 74
  • It's funny how "how to speedup" questions are correlated with the use of `iterrows`. You need to `merge_asof` on one of the bounds, then check the other bound. Or use an`IntervalIndex` – mozway Aug 02 '22 at 21:05
  • @mozway i wasn't away of these functions, they seem very helpful. But I'm not sure I see how to account for multiple instances of a row in `x` being found in `y` if i want to retain all of those instances. – The Nightman Aug 02 '22 at 21:18
  • `merge_asof` is a left merge, so you can have duplicates in `x`, this doesn't matter. Try the various solutions in the duplicated question, [this one](https://stackoverflow.com/a/46528074/16343464) for example. – mozway Aug 02 '22 at 21:22

0 Answers0