3

I have 2 dataframes I would like to merge. Both continue for 300 seconds (in the start column). They are below.

df_1:

   color  start  stop
0   blue   2.72  2.85
1  green   2.86  3.09
2   blue   3.10  3.47
3  green   3.48  4.69
4   blue   4.70  5.97
5  green   5.98  7.07  

df_2:

  confidence start  
0      .11  2.79
1      .78  2.99  
2      .65  3.04  
3      .22  3.43  
4      .54  3.61  
5      .99  3.99  
6      .52  4.24 
7      .63  4.31 
8      .71  4.67 
9      .82  4.85 
10     .81  5.09 
11     .33  5.26 
12     .31  5.69 
13     .44  5.99 
14     .55  6.22 
15     .81  6.43 
16     .31  6.93         
17     .32  7.01 

… and so on

I want to merge the aggregated mean value for df_2['confidence'] when df_2['start'] value is in between the df_1['start'] and df_1['stop'] value.

Ideally, it looks like this:

      color start  stop  confidence
0      blue 2.72  2.85    .11
1     green 2.86  3.09    .72
2      blue 3.10  3.47    .22
3     green 3.48  4.69    .68
4      blue 4.70  5.97    .57
5     green 5.98  7.07    .49 

Thanks!

cs95
  • 379,657
  • 97
  • 704
  • 746
connor449
  • 1,549
  • 2
  • 18
  • 49
  • Gotcha, thanks for the tip. Both are useful, but accepting yours as it is more intuitive to me. Thank you! – connor449 Jun 26 '19 at 17:23

2 Answers2

4

You can use the IntervalIndex to build an interval tree, then get the positions of df2['start'] using IntervalIndex.get_indexer, then finally group and find the mean:

idx = pd.IntervalIndex.from_arrays(df['start'], df['stop']) 

df.join(
    df2.groupby(idx.get_indexer(df2['start']))['confidence'].mean(), how='left')

   color  start  stop  confidence
0   blue   2.72  2.85      0.1100
1  green   2.86  3.09      0.7150
2   blue   3.10  3.47      0.2200
3  green   3.48  4.69      0.6780
4   blue   4.70  5.97      0.5675
5  green   5.98  7.07      0.4860
cs95
  • 379,657
  • 97
  • 704
  • 746
3

IIUC, you can use pd.cut and groupby, then merge:

# bins for cut
bins=[df1.start[0] ] + df1.stop.to_list()

# label the start in df2 by cuts:
s = pd.cut(df2.start, bins=bins, labels=df1.start)

# group df2 by the cuts:
new_df = df2.groupby(s).confidence.mean()

# merge
df1.merge(new_df, left_on='start', right_index=True)

gives you:

   color start  stop  confidence
0   blue  2.72  2.85    0.110000
1  green  2.85  3.09    0.715000
2   blue  3.09  3.47    0.220000
3  green  8.43  8.69    0.577857
4   blue  8.69  8.97         NaN
5  green  8.97  9.07         NaN

With the edited df1 (which makes sense as it matches the output):

   color start  stop  confidence
0   blue  2.72  2.85      0.1100
1  green  2.86  3.09      0.7150
2   blue   3.1  3.47      0.2200
3  green  3.48  4.69      0.6780
4   blue   4.7  5.97      0.5675
5  green  5.98  7.07      0.4860
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74