3

I'm doing a merge in R of my tables using the foverlaps function. But I need to reproduce the same output using python. I did a search and I found the merge function on pandas library. But even using this function, I can't reproduce the same output.

First the outputs in R:

This is the first table (intervals):

   V1 V2 intid
1:  1  5     1
2:  4  9     2
3:  6 12     3
4: 11 17     4
5: 18 20     5

This is the second table (decomp):

   V1 V2 subid
1:  1  4     A
2:  4  5     B
3:  5  6     C
4:  6  9     D
5:  9 11     E
6: 11 12     F
7: 12 17     G
8: 17 18     H
9: 18 20     I

The code in R that makes the merge:

relations <- foverlaps(decomp, intervals, type='within', nomatch=0)

The output (relations):

    V1 V2 intid i.V1 i.V2 subid
 1:  1  5     1    1    4     A
 2:  1  5     1    4    5     B
 3:  4  9     2    4    5     B
 4:  4  9     2    5    6     C
 5:  4  9     2    6    9     D
 6:  6 12     3    6    9     D
 7:  6 12     3    9   11     E
 8:  6 12     3   11   12     F
 9: 11 17     4   11   12     F
10: 11 17     4   12   17     G
11: 18 20     5   18   20     I

Now the outputs I have in python:

This is the first table (df_of_pairs):

   V1  V2  intid
0   1   5      1
1   4   9      2
2   6  12      3
3  11  17      4
4  18  20      5

This is the second table (df_of_adjacent):

   V1  V2 subid
0   1   4     A
1   4   5     B
2   5   6     C
3   6   9     D
4   9  11     E
5  11  12     F
6  12  17     G
7  17  18     H
8  18  20     I

Now is the problem, I did not reproduce the same output in python when I used the pandas merge. I tried it in several ways and I did not succeed with any, here's one of the ways I've used it:

df = df_of_pairs.merge(df_of_adjacent, left_on=['V1'], right_on=['V2'] )

The output (df):

   V1_x  V2_x  intid  V1_y  V2_y subid
0     4     9      2     1     4     A
1     6    12      3     5     6     C
2    11    17      4     9    11     E
3    18    20      5    17    18     H

This question is very similar to R foverlaps equivalent in Python, but in that case it has different columns.

Pablo Pavan
  • 144
  • 5

1 Answers1

2

I can't easily get your exact desired output, but here's a partial solution using IntervalIndex.

s1 = pd.IntervalIndex.from_arrays(df1['V1'], df1['V2'])  # default: closed='right'
s2 = pd.IntervalIndex.from_arrays(df2['V1'], df2['V2'])
df_of_adjacent.set_index(s2, inplace=True)
df_of_adjacent.loc[s1]
          V1  V2 subid
(1, 4]     1   4     A
(4, 5]     4   5     B
(4, 5]     4   5     B
(5, 6]     5   6     C
(6, 9]     6   9     D
(6, 9]     6   9     D
(9, 11]    9  11     E
(11, 12]  11  12     F
(11, 12]  11  12     F
(12, 17]  12  17     G
(18, 20]  18  20     I
Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37
  • Thanks for your helping! I'll try to use this function. Do you know of other functions that might solve this problem? Maybe another library to python? – Pablo Pavan Apr 18 '18 at 20:33
  • hello @peter, I can't get the same output you got! My result is the same as the original dataframe! What could be happening? – Pablo Pavan Apr 19 '18 at 17:53
  • @PabloPavan, I'm not aware of any other python packages that duplicate the results of R's `foverlaps`, but these two threads look relevant: https://stackoverflow.com/questions/40247095/r-foverlaps-equivalent-in-python/40252872#40252872 https://stackoverflow.com/questions/30627968/merge-pandas-dataframes-where-one-value-is-between-two-others/42796283 As for why you're getting the same results as your original DataFrame, you must be missing `inplace=True` or otherwise not saving the results of the third line in my answer. – Peter Leimbigler Apr 20 '18 at 00:50