Merging two pandas dataframes with complex conditions

Question

I would like to merge two dataframes. Let's consider the following two dfs:

df1:

id_A,           ts_A,    course,     weight
id1, 2017-04-27 01:35:30, cotton,      3.5
id1, 2017-04-27 01:36:05, cotton,      3.5
id1, 2017-04-27 01:36:55, cotton,      3.5
id1, 2017-04-27 01:37:20, cotton,      3.5
id2, 2017-04-27 02:35:35, cotton blue, 5.0
id2, 2017-04-27 02:36:00, cotton blue, 5.0
id2, 2017-04-27 02:36:35, cotton blue, 5.0
id2, 2017-04-27 02:37:20, cotton blue, 5.0

df2:

id_B,  ts_B,                 value
id1,   2017-03-27 01:25:40,  100
id1,   2017-03-27 01:25:50,  200
id1,   2017-03-27 01:25:50,  230
id1,   2017-04-27 01:35:40,  240
id1,   2017-04-27 01:35:50,  200
id1,   2017-04-27 01:36:00,  350
id1,   2017-04-27 01:36:10,  400
id1,   2017-04-27 01:36:20,  500
id1,   2017-04-27 01:36:30,  600
id1,   2017-04-27 01:36:40,  700
id1,   2017-04-27 01:36:50,  800
id1,   2017-04-27 01:37:00,  900
id1,   2017-04-27 01:37:10, 1000
id2,   2017-04-27 02:35:40,  1000
id2,   2017-04-27 02:35:50,  2000
id2,   2017-04-27 02:36:00,  4500
id2,   2017-04-27 02:36:10,  3000
id2,   2017-04-27 02:36:20,  6000
id2,   2017-04-27 02:36:30,  5000
id2,   2017-04-27 02:36:40,  5022
id2,   2017-04-27 02:36:50,  5040
id2,   2017-04-27 02:37:00,  3200
id2,   2017-04-27 02:37:10,  9000

df1 should be merged with df2 such that the following condition holds: Given the time interval as the difference between two consecutive rows in df1, I want to merge it with the average value of all the rows in df2 that follow within that time interval. For example,

id_A,           ts_A,    course,     weight
id1, 2017-04-27 01:35:30, cotton,      3.5

should be merged

id_B,  ts_B,                 value
id1,   2017-04-27 01:35:40,  240
id1,   2017-04-27 01:35:50,  200
id1,   2017-04-27 01:36:00,  350

and obtaining

id_A,           ts_A,    course,     weight  avgValue
id1, 2017-04-27 01:35:30, cotton,      3.5  263.3

I tried to see the problem from another perspective - which would include the missing rows of df2 into df1 - by using merge_asof but I do not get the right result:

pd.merge_asof(df2_sorted, df1, left_on='ts_B', right_on='ts_A', left_by='id_B', right_by='id_A', direction='backward')

jezrael · Accepted Answer · 2017-07-21T12:15:13.853

4

I think you need merge_asof, but for counter is used reset_index for unique value per row in df1:

df1 = df1.reset_index(drop=True)
print (df1.index)
RangeIndex(start=0, stop=8, step=1)

df = pd.merge_asof(df2_sorted, 
                   df1.reset_index(), 
                   left_on='ts_B', 
                   right_on='ts_A', 
                   left_by='id_B', 
                   right_by='id_A')

And then groupby by output columns (dont forget for index column) and aggregate mean:

df = df.groupby(['id_A','ts_A', 'course', 'weight', 'index'], as_index=False)['value']
       .mean()
       .drop('index', axis=1)
print (df)
  id_A                ts_A       course  weight        value
0  id1 2017-04-27 01:35:30       cotton     3.5   263.333333
1  id1 2017-04-27 01:36:05       cotton     3.5   600.000000
2  id1 2017-04-27 01:36:55       cotton     3.5   950.000000
3  id2 2017-04-27 02:35:35  cotton blue     5.0  1500.000000
4  id2 2017-04-27 02:36:00  cotton blue     5.0  4625.000000
5  id2 2017-04-27 02:36:35  cotton blue     5.0  5565.500000

edited Jul 21 '17 at 12:15

answered Jul 21 '17 at 12:06

jezrael

822,522
95
1,334
1,252

Many Thanks. I am applying it to my case. Few minutes and I come back. – Carlo Allocca Jul 21 '17 at 12:29
I got the following error when executing df = df.groupby(schema2, as_index=False)['value'].mean().drop('index', axis=1) raise DataError('No numeric types to aggregate') pandas.core.base.DataError: No numeric types to aggregate – Carlo Allocca Jul 21 '17 at 12:49
I think you need `df2['value'] = df2['value'].astype(float)` if floats or `df2['value'] = df2['value'].astype(float)` if ints values as first step. – jezrael Jul 21 '17 at 12:54

Merging two pandas dataframes with complex conditions

1 Answers1

Linked