5

I have a DataFrame with measurements, containing the values of the measurement and the times.

time = [datetime.datetime(2011, 1, 1, np.random.randint(0,23), np.random.randint(1, 59)) for _ in xrange(10)]
df_meas = pandas.DataFrame({'time': time, 'value': np.random.random(10)})

for example:

                 time     value
0 2011-01-01 21:56:00  0.115025
1 2011-01-01 04:40:00  0.678882
2 2011-01-01 02:18:00  0.507168
3 2011-01-01 22:40:00  0.938408
4 2011-01-01 12:53:00  0.193573
5 2011-01-01 19:37:00  0.464744
6 2011-01-01 16:06:00  0.794495
7 2011-01-01 18:32:00  0.482684
8 2011-01-01 13:26:00  0.381747
9 2011-01-01 01:50:00  0.035798

the data-taking is organized in periods and I have another DataFrame for it:

start = pandas.date_range('1/1/2011', periods=5, freq='H')
stop = start + np.timedelta64(50, 'm')
df_runs = pandas.DataFrame({'start': start, 'stop': stop}, index=np.random.randint(0, 1000000, 5))
df_runs.index.name = 'run'

for example:

                     start                stop
run                                           
721158 2011-01-01 00:00:00 2011-01-01 00:50:00
340902 2011-01-01 01:00:00 2011-01-01 01:50:00
211578 2011-01-01 02:00:00 2011-01-01 02:50:00
120232 2011-01-01 03:00:00 2011-01-01 03:50:00
122199 2011-01-01 04:00:00 2011-01-01 04:50:00

Now I want to merge the two tables, obtaining:

                 time     value   run
0 2011-01-01 21:56:00  0.115025   NaN
1 2011-01-01 04:40:00  0.678882   122199  
2 2011-01-01 02:18:00  0.507168   211578 
3 2011-01-01 22:40:00  0.938408   NaN
...

time periods (runs) have a start and a stop and stop >= start. Different runs never overlap. (Even if in my example it is not true) you can assume that runs are ordered (by run) and if run1 < run2 then start1 < start2 (or you can simply sort the table by start). You can also assume that df_meas is sorted by time.

How to do that? Is there something build in? What is the most efficient way?

Ruggero Turra
  • 16,929
  • 16
  • 85
  • 141

4 Answers4

2

You can first reshape df_runs by stack - start and stop are in one column time. Then groupby by run, resample by minutes and ffill for filling NaN values. Last merge to df_meas:

Notice - this code works in last pandas version 0.18.1 see docs.

import pandas as pd
import numpy as np
import datetime as datetime

#for testing
np.random.seed(1)
time = [datetime.datetime(2011, 1, 1, np.random.randint(0,23), np.random.randint(1, 59)) for _ in range(10)]
df_meas = pd.DataFrame({'time': time, 'value': np.random.random(10)})

start = pd.date_range('1/1/2011', periods=5, freq='H')
stop = start + np.timedelta64(50, 'm')
df_runs = pd.DataFrame({'start': start, 'stop': stop}, index=np.random.randint(0, 1000000, 5))
df_runs.index.name = 'run'

df = (df_runs.stack().reset_index(level=1, drop=True).reset_index(name='time'))
print (df)
      run                time
0   99335 2011-01-01 00:00:00
1   99335 2011-01-01 00:50:00
2  823615 2011-01-01 01:00:00
3  823615 2011-01-01 01:50:00
4  117565 2011-01-01 02:00:00
5  117565 2011-01-01 02:50:00
6  790038 2011-01-01 03:00:00
7  790038 2011-01-01 03:50:00
8  369977 2011-01-01 04:00:00
9  369977 2011-01-01 04:50:00

df1 = (df.set_index('time')
         .groupby('run')
         .resample('Min')
         .ffill()
         .reset_index(level=0, drop=True)
         .reset_index())

print (df1)
                   time     run
0   2011-01-01 00:00:00   99335
1   2011-01-01 00:01:00   99335
2   2011-01-01 00:02:00   99335
3   2011-01-01 00:03:00   99335
4   2011-01-01 00:04:00   99335
5   2011-01-01 00:05:00   99335
6   2011-01-01 00:06:00   99335
7   2011-01-01 00:07:00   99335
8   2011-01-01 00:08:00   99335
9   2011-01-01 00:09:00   99335
...
...
print (pd.merge(df_meas, df1, on='time', how='left'))
                 time     value       run
0 2011-01-01 05:44:00  0.524548       NaN
1 2011-01-01 12:09:00  0.443453       NaN
2 2011-01-01 09:12:00  0.229577       NaN
3 2011-01-01 05:16:00  0.534414       NaN
4 2011-01-01 00:17:00  0.913962   99335.0
5 2011-01-01 01:13:00  0.457205  823615.0
6 2011-01-01 07:46:00  0.430699       NaN
7 2011-01-01 06:26:00  0.939128       NaN
8 2011-01-01 18:21:00  0.778389       NaN
9 2011-01-01 05:19:00  0.715971       NaN

Solution of IanS is very nice, and I try improve it with pd.lreshape:

df_runs['run1'] = -1 
df_runs = df_runs.reset_index()

run_times = (pd.lreshape(df_runs, {'Run':['run', 'run1'], 
                                   'Time':['start', 'stop']})
               .sort_values('Time')
               .set_index('Time'))

print (run_times['Run'].asof(df_meas['time']))

time
2011-01-01 05:44:00        -1
2011-01-01 12:09:00        -1
2011-01-01 09:12:00        -1
2011-01-01 05:16:00        -1
2011-01-01 00:17:00     99335
2011-01-01 01:13:00    823615
2011-01-01 07:46:00        -1
2011-01-01 06:26:00        -1
2011-01-01 18:21:00        -1
2011-01-01 05:19:00        -1
Name: Run, dtype: int64
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Edit: As suggested in a comment, there is no need to sort the times. Rather, use stack instead of unstack.

First step: transform the times dataframe

Since the start and end times are nicely ordered, I set them as index. I also add a column with the run id for starts, and NaN for stops. I do this in many lines (hopefully each one self-explanatory), but you could certainly condense the code:

run_times = df_runs.stack().to_frame(name='times')
run_times.reset_index(inplace=True)
run_times['actual_run'] = np.where(run_times['level_1'] == 'start', run_times['run'], np.nan)
run_times.drop(['level_1', 'run'], axis=1, inplace=True)
run_times.set_index('times', drop=True, inplace=True)

Result:

In[101] : run_times
Out[101]: 
                     actual_run
times                          
2011-01-01 00:00:00      110343
2011-01-01 00:50:00         NaN
2011-01-01 01:00:00      839451
2011-01-01 01:50:00         NaN
2011-01-01 02:00:00      742879
2011-01-01 02:50:00         NaN
2011-01-01 03:00:00      275509
2011-01-01 03:50:00         NaN
2011-01-01 04:00:00      788777
2011-01-01 04:50:00         NaN

Second step: lookup the values

You can now look this up in the original dataframe with the asof method:

In[131] : run_times['actual_run'].fillna(-1).asof(df_meas['time'])
Out[131]: 
2011-01-01 21:56:00        -1
2011-01-01 04:40:00    122199
2011-01-01 02:18:00    211578
2011-01-01 22:40:00        -1
2011-01-01 12:53:00        -1

Note that I had to use -1 instead of NaN because asof returns the last valid value.

IanS
  • 15,771
  • 9
  • 60
  • 84
1

Edited

If you want to benefit from tables being sorted, sometimes (or usually) it's better to leave it to pandas (or numpy). For example with merging two sorted arrays, there is not much you can do by hand, as this answer suggests. And pandas uses low level functions to do it automatically.

I measured time used by asof (as in A.asof(I)) and it looked as if it didn't benefit from I being sorted. But I don't see an easy way to beat it, if at all it's possible.

In my tests, asof was even faster than .loc when the index (A.index) already contained I. The only object I know of that could take advantage of the indices being sorted is pd.Index. And indeed, A.reindex(idx) for idx = pd.Index(I) was much faster (to use it, A.index has to be unique). Unfortunately, the time needed to build the right data frame or series outweighed the benefits.


The answer by @IanS and @jezrael are very fast. In fact, most of the time (almost 40%) in jezrael's second function is spent in lreshape. sort_values and asof take up to 15%.

Certainly, it's possible to optimize it further. The results are quite good, so I put it here.

I use the following setup to generate sorted data frames for testing:

def setup(intervals, periods):
    time = [datetime.datetime(2011, 1, 1, np.random.randint(0,23), np.random.randint(1, 59)) for _ in range(intervals)]
    df_meas = pd.DataFrame({'time': time, 'value': np.random.random(intervals)})
    df_meas = df_meas.sort_values(by='time')
    df_meas.index = range(df_meas.shape[0])

    start = pd.date_range('1/1/2011', periods=periods, freq='H')
    stop = start + np.timedelta64(50, 'm')
    df_runs = pd.DataFrame({'start': start, 'stop': stop}, index=np.unique(np.random.randint(0, 1000000, periods)))
    df_runs.index.name = 'run'

    return df_meas, df_runs

The function benefits from the use of asof and some tricks to cut down unnecessary formatting.

def run(df_meas, df_runs):
    run_times = pd.Series(np.concatenate([df_runs.index, [-1] * df_runs.shape[0]]),
                      index=df_runs.values.flatten(order='F'))
    run_times.sort_index(inplace=True)
    return run_times.asof(df_meas['time'])

I tested it with intervals=100 and periods=20. The results measured with timeit:

# @jezrael's second function: 
100 loops, best of 3: 3.43 ms per loop
# @IanS's function:
100 loops, best of 3: 3.92 ms per loop
# my function:
1000 loops, best of 3: 752 µs per loop
Community
  • 1
  • 1
ptrj
  • 5,152
  • 18
  • 31
-1

The merge() function could be used to merge 2 dataframes horizontally:

merge(x, y, by ="name")  # merge df x and y using the "name" column

So you may have to rename the "start" col of the first dataframe in "time" and have a try...