I have two pandas DataFrames that look like this:
x = pd.DataFrame({'A': {0: 'other info', 1: 'other info', 2: 'other info', 3: 'other info', 4: 'other info'},
'B': {0: '1', 1: '1', 2: '2', 3: '2', 4: '3'},
'C': {0: 100, 1: 160, 2: 120, 3: 130, 4: 170}})
y = pd.DataFrame({'B': {0: '1', 1: '1', 2: '1', 3: '1', 4: '1'},
'start': {0: 90, 1: 100, 2: 125, 3: 150, 4: 160},
'end': {0: 110, 1: 115, 2: 135, 3: 160, 4: 200},
'info': {0: 'info', 1: 'info', 2: 'info', 3: 'info', 4: 'info'}})
Both DataFrames are sorted. x
is sorted first on B
then on C
. y is sorted on B
then on start
. Both DataFrames are about 20 million rows long.
The goal is to find data within x
that has an entry within y
that has a matching column B
and column C
that is >= start
and <= end
.
merged_result = pd.DataFrame({
'A': {0: 'other info', 1: 'other info'},
'B': {0: '1', 1: '1'},
'C': {0: 100, 1: 100}
'start': {0: 90, 1: 105},
'end': {0: 110, 1: 115},
'info': {0: 'info', 1: 'info'}})
I am currently doing this with the following code, but it takes about 0.3 seconds per row which takes far too long to run through the dataset. Is there a much faster way to do the same thing?
import time
start_time = time.time()
for index, row in x.head().iterrows():
merged = y[(y.start <= row.POS) &
(y.end >= row.POS) &
(y.CHROM == row.CHROM)].merge(pd.DataFrame(row).T, on='B', how='left')
out = out.append(merged).drop_duplicates(subset=['B',
'C',
])
print("--- %s seconds ---" % (time.time() - start_time))