Averaging data while merging

Question

I have two dataframes as shown below:

 result1

     time         browncarbon          blackcarbon
 180.7452   0.506824055392119   0.4693240205237933
 180.748    0.5040641475588111  0.4671092323195378
 180.7508   0.49911820575405846 0.46344714546409305
 180.7535   0.4957944583911674  0.46030629341216533
 180.7563   0.4888745617073804  0.45557451231658985
 180.7591   0.4864626914800723  0.45633142113414893
 180.7619   0.48328511735148877 0.4548510376145042
 180.7646   0.484728828747634   0.4572818652186026
 180.7674   0.4840750981022636  0.45772491443336777
 180.7702   0.4843291425046101  0.4588332952196751

 422 rows x 3 columns

 result2

    start        end      toc 
 180.7452   180.7466    192.0
 180.7438   180.7452    194.0
 180.7424   180.7438    199.0
  180.741   180.7424    208.0
 180.7396   180.741     229.0
 180.7383   180.7396    245.0
 180.7369   180.7383    252.0
 180.7355   180.7369    245.0
 180.7341   180.7355    238.0
 180.7327   180.7341    245.0

 1364 rows x 3 columns

The multiple start and end rows that get encapsulated into one of the time rows should also correspond to one toc row, which should be the average of the multiple toc rows. How do I do that? There is a related answer on Stack Overflow. The link is: Merging two pandas dataframes with complex conditions

result3

result1['rank'] = np.arange(length1)
result3=pd.merge_asof(result1.sort_values('time'),result2,left_on='time',right_on='start')
result3.sort_values('rank').drop(['rank','start','end'], axis=1)

    time          browncarbon          blackcarbon    toc
180.7452    0.506824055392119   0.4693240205237933
 180.748    0.5040641475588111  0.4671092323195378
180.7508    0.49911820575405846 0.46344714546409305
180.7535    0.4957944583911674  0.46030629341216533
180.7563    0.4888745617073804  0.45557451231658985
180.7591    0.4864626914800723  0.45633142113414893
180.7619    0.48328511735148877 0.4548510376145042
180.7646    0.484728828747634   0.4572818652186026
180.7674    0.4840750981022636  0.45772491443336777
180.7702    0.4843291425046101  0.4588332952196751

422 rows X 4 columns

after `merge_as_of`, do not drop `start`. Then you can groupby `start` and `sum`. — Quang Hoang, Jul 17 '19 at 20:26

score 0 · Answer 1 · answered Jul 18 '19 at 13:52

Use cross join by all combination of rows, then filter by boolean indexing with Series.between and aggregate mean, last DataFrame.join to original:

df = result1.assign(a=1).merge(result2.assign(a=1), on='a', how='outer')

s=df[df['time'].between(df['start'],df['end'])].groupby(result1.columns.tolist())['toc'].mean()
df = result1.join(s, result1.columns.tolist())
print (df)
       time  browncarbon  blackcarbon    toc
0  180.7452     0.506824     0.469324  193.0
1  180.7480     0.504064     0.467109    NaN
2  180.7508     0.499118     0.463447    NaN
3  180.7535     0.495794     0.460306    NaN
4  180.7563     0.488875     0.455575    NaN
5  180.7591     0.486463     0.456331    NaN
6  180.7619     0.483285     0.454851    NaN
7  180.7646     0.484729     0.457282    NaN
8  180.7674     0.484075     0.457725    NaN
9  180.7702     0.484329     0.458833    NaN

I am extremely grateful to you fo the code. However, there are 1364 toc values, just like start and end values. Just like 1364 start and end values get merged with 422 time values, there should be 422 toc values, which should be the result of averaging between groups of toc values between the first and the last toc value. Further, when I ran the code, it is giving me NaN in the toc column, every time there is an NaN in the blackcarbon column, although they have no correlation. — Sujai Banerji, Jul 19 '19 at 00:13
@SujaiBanerji - I think you need match average between start and end with time, but it seems not. Can you add expected output to question from sample data for better explanation? — jezrael, Jul 19 '19 at 05:15

score 0 · Answer 2 · answered Jul 19 '19 at 22:00

jezrael's answer above is good, but I would add that grouping by columns that may have NaN values will remove these records. I would only group by time and then put together the resulting series into a new dataframe:

df_aux = result1.assign(a=1).merge(result2.assign(a=1), on='a', how='outer')
series_aux = df[df['time'].between(df['start'],df['end'])].groupby('time')['toc'].mean()

This returns a Pandas Series that you can then combine with any data from result1 that you want to keep.

Averaging data while merging

2 Answers2