Efficiently generating labels for timeseries data based on future timeseries data

Question

I have two years worth of sensor data in a pandas dataframe. The index is a time series. Looks like this:


                     temp1     overtemp  time_to_overtemp
datetime                                                    
2019-01-02 09:31:00  305.96
2019-01-02 09:32:00  305.98
2019-01-02 09:33:00  305.70
2019-01-02 09:34:00  305.30
2019-01-02 09:35:00  306.88

What I want to do is loop over the time series to populate the "overtemp" and "time_to_overtemp columns. "overtemp" needs to be assigned a 1 if the temperature data anytime in the next two weeks increases by more than 2%. "time_to_overtemp" needs to show the time of the next >2% reading, if it exists in the next two weeks. If the temperature says within 2% for the next two weeks, both columns should be assigned a 0.

For example 2019-01-02 09:31:00 should look at the next two weeks worth of temperature data and put a 0 in both columns because all data in that time period is within 2% of the value. The overtemp value for 2020-01-02 09:35:00 should be a 1 because the value increased by 5% a week later. The time_to_overtemp value should indicate 7 days, 2 hours, 38 minutes because thats when the overtemp occured.

I am successfully doing some more math stuff using iterrows:

for datetime, row in df.iterrows():

but its taking forever. And I am not getting how to do the time iterations and calculations at all yet.

I have done other labeling with:

df['overtemp'] = np.select([df['temp1']<305, df['temp1']>305], [1,0])

I guess this vectorizes the process? It sure works a lot faster than iterating. But I can't figure out how to implement the datetime+two week portion.

caiolopes · Answer 1 · 2020-08-05T23:34:11.203

Here is a suggestion on how to approach your problem by using apply and indexing.

But it is not a very fast solution indeed, maybe pandas have a better function to do this.

Or you can parallelize the apply function to make it faster.

df = pd.DataFrame(pd.date_range(start='2020-01-01', end='2020-03-01', freq='H'), columns=['datetime'])
df['temp'] =  [np.random.uniform(low=300, high=310) for _ in range(df.shape[0])]

def get_overtemp(row):
    now = row.datetime
    next_14_days = now + timedelta(days=14)
    temp = row.temp
    
    filtered = df[
        (df['datetime'] > now) & 
        (df['datetime'] <= next_14_days) & 
        (df['temp'] > temp * 1.02)
    ]
    
    overtemp = len(filtered) > 0
    time_to_overtemp = None
    if overtemp:
        time_to_overtemp = filtered['datetime'].values[0] - now
        
    return pd.Series([overtemp, time_to_overtemp])

df[['overtemp', 'time_to_overtemp']] = df.apply(get_overtemp, axis=1)
df.head(20)

Result:

|    | datetime            |    temp | overtemp   | time_to_overtemp   |
|---:|:--------------------|--------:|:-----------|:-------------------|
|  0 | 2020-01-01 00:00:00 | 309.502 | False      | NaT                |
|  1 | 2020-01-01 01:00:00 | 303.816 | True       | 7 days 00:00:00    |
|  2 | 2020-01-01 02:00:00 | 307.297 | False      | NaT                |
|  3 | 2020-01-01 03:00:00 | 306.252 | False      | NaT                |
|  4 | 2020-01-01 04:00:00 | 303.458 | True       | 0 days 07:00:00    |
|  5 | 2020-01-01 05:00:00 | 304.27  | False      | NaT                |
|  6 | 2020-01-01 06:00:00 | 300.98  | True       | 0 days 05:00:00    |
|  7 | 2020-01-01 07:00:00 | 306.652 | False      | NaT                |
|  8 | 2020-01-01 08:00:00 | 304.107 | False      | NaT                |
|  9 | 2020-01-01 09:00:00 | 300.077 | True       | 0 days 02:00:00    |

See: https://github.com/nalepae/pandarallel

I see what youre doing, but I guess Im a bit more confused now. Youre not using a datetime index, so Im not sure how your pd.Timedelta() is functioning correctly. Thanks for taking the time to help though. Ill try to expand on this to get the datetime index working with it. — Fr3nch, Aug 06 '20 at 00:59
I did not using the datetime as the dataframe index. To do the same as my toy example, just run `df = df.reset_index()`. The timedelta I used is `from datetime import timedelta` — caiolopes, Aug 06 '20 at 01:00

score 0 · Accepted Answer · answered Aug 06 '20 at 00:49

Think you could do it with pandas rolling by applying funcs, here is my go, hope this is what you need:

# create random data
ind = pd.date_range(start = pd.to_datetime("2019-01-02 09:31:00"), periods=28*24*60, freq='min')
v = [random.randint(30000, 32000)/100 for x in ind]
df = pd.DataFrame(v, index=ind, columns = ['temp1'])

# define funcs
def overtemp(S):
    l = S[S>=S[-1]*1.02]
    if len(l)>0:
        return l[-1]
    else:
        return 0

def overtemp_seconds(S):
    l = np.argwhere(S.values>=S.values[-1]*1.02)
    if len(l)>0:
        i = l[-1][0]
        delta = S.index[i] - S.index[-1]
        return delta.seconds
    else:
        return 0

# apply funcs to time window (reversed cause rolling gives you time window till offset)
over_temp = df[::-1].rolling('14D').apply(overtemp)[::-1]['temp1']
seconds_to_over_temp = df[::-1].rolling('14D').apply(overtemp_seconds)[::-1]['temp1']

# add to orig df
df['over_temp'] = over_temp
df['seconds_to_over_temp'] = seconds_to_over_temp

Thanks for the help on this one! That works! Doesn't work fast, haha, but I dont think Im going to get a fast solution to this problem. Ill work on parallelizing this to at least cut the time down somewhat. Another quick question, do you have any advise about learning more pandas? Ive been blowing through all the resources I can think of for the last 6 months and feel like I should be able to handle tasks like this on my own. Obviously I can't. — Fr3nch, Aug 06 '20 at 20:24
don' t have any specific ideas for resources, sorry. Can tell you that I first heard of rolling from one of my team members, when I read your question I gussed there should be a way of applying a user defined func over the rolling window and looked for it in the pandas manual, took some time but thought me many things. Think that to learn about specific advanced pandas methods one should read blogs, so q&a's etc. — Ezer K, Aug 06 '20 at 20:55

Efficiently generating labels for timeseries data based on future timeseries data

2 Answers2