Pandas, check if timestamp value exists in resampled 30 min time bin of datetimeindex

Question

I have created a resampled data frame (DF1) in pandas with a datetimeindex. I have a separate dataframe (DF2) with a datetimeindex and time column. If an instance of time from DF2 falls within the 30 min bins of datetimeindex in DF1. I want to mark each instance of time in DF2 with the appropriate speed from the 30 min bin in DF1.

DF1

                   boat_id      speed
time                                      
2015-01-13 09:00:00   28.000000   0.000000
2015-01-13 09:30:00   28.000000   0.723503
2015-01-13 10:00:00   28.000000   2.239399

DF2

                      id  boat_id                 time  state     
time                                                                          
2015-01-18 16:09:03   319437       28  2015-01-18 16:09:03      2    
2015-01-18 16:18:43   319451       28  2015-01-18 16:18:43      0    
2015-03-01 09:39:51   507108       31  2015-03-01 09:39:51      1    
2015-03-01 09:40:58   507109       31  2015-03-01 09:40:58      0

Desired Result

                      id  boat_id                 time      state   speed
time                                                                          
2015-01-18 16:09:03   319437       28  2015-01-18 16:09:03      2 nan   
2015-01-18 16:18:43   319451       28  2015-01-18 16:18:43      0 nan   
2015-03-01 09:39:51   507108       31  2015-03-01 09:39:51      1 2.239399   
2015-03-01 09:40:58   507109       31  2015-03-01 09:40:58      0 2.239399

I created this script to try and do this but I think it's failing because datetimeindex of DF1 is immutable and so my timedelta request doesn't create a start point for the chunk. One thought I had was if it would be possible to copy the datetimeindex of DF1 into a new column where the objects are mutable but I haven't managed it yet so am not 100% sure of the logic. I'm happy to tinker but at the moment I've been stalled for a while so was hoping someone else might have a few ideas.

for row in DF1.iterrows():
    for dfrow in DF2.iterrows():
        if dfrow[0] > row[0] - dt.timedelta(minutes=30) and dfrow[0] < row[0]:
            df['test'] =  row[1]

Why don't you try [`between_time()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.between_time.html)? — Kartik, Nov 05 '15 at 20:31
@Kartik, Thanks for the suggestion, I tried `for row in DF1.iterrows(): for dfrow in DF2.iterrows(): if dfrow['time'] == DF1[pd.DataFrame.between_time(row[0] - dt.timedelta(minutes=15), row[0])]: df['test'] = row[1]` but got the result `TypeError: tuple indices must be integers, not str` If I use an int instead of string I get either `IndexError: tuple index out of range` or `TypeError: unbound method between_time() must be called with DataFrame instance as first argument (got Timestamp instance instead)` Am I missing something? — hselbie, Nov 05 '15 at 20:51
I wonder if it has something to do with the headers in the dataframes? — hselbie, Nov 05 '15 at 21:01

score 1 · Accepted Answer · edited May 23 '17 at 11:44

Performance of iterating is very low. Better is use vectorized solution. I use twice function merge. Docs.

Input:

print df1
                     boat_id     speed
time                                  
2015-03-01 09:00:00       28  0.000000
2015-03-01 09:30:00       28  0.723503
2015-03-01 10:00:00       28  2.239399

print df2
                         id  boat_id                time  state
time                                                           
2015-01-18 16:09:03  319437       28 2015-01-18 16:09:03      2
2015-01-18 16:18:43  319451       28 2015-01-18 16:18:43      0
2015-03-01 09:39:51  507108       31 2015-03-01 09:39:51      1
2015-03-01 09:40:58  507109       31 2015-03-01 09:40:58      0

I reset index of both dataframes and create helper column i filled by 1.

df1 = df1.reset_index()
df2 = df2.reset_index(drop=True)
df1['i'] =  df2['i'] = 1
print df1
                 time  boat_id     speed  i
0 2015-03-01 09:00:00       28  0.000000  1
1 2015-03-01 09:30:00       28  0.723503  1
2 2015-03-01 10:00:00       28  2.239399  1
print df2
       id  boat_id                time  state  i
0  319437       28 2015-01-18 16:09:03      2  1
1  319451       28 2015-01-18 16:18:43      0  1
2  507108       31 2015-03-01 09:39:51      1  1
3  507109       31 2015-03-01 09:40:58      0  1

Then I merged both dataframes by helper column i.

df = df2.merge(df1, on='i', how='left')
df = df.rename(columns={'time_y':'Bin_time', 'time_x':'time'})
print df
        id  boat_id_x                time  state  i            Bin_time  \
0   319437         28 2015-01-18 16:09:03      2  1 2015-03-01 09:00:00   
1   319437         28 2015-01-18 16:09:03      2  1 2015-03-01 09:30:00   
2   319437         28 2015-01-18 16:09:03      2  1 2015-03-01 10:00:00   
3   319451         28 2015-01-18 16:18:43      0  1 2015-03-01 09:00:00   
4   319451         28 2015-01-18 16:18:43      0  1 2015-03-01 09:30:00   
5   319451         28 2015-01-18 16:18:43      0  1 2015-03-01 10:00:00   
6   507108         31 2015-03-01 09:39:51      1  1 2015-03-01 09:00:00   
7   507108         31 2015-03-01 09:39:51      1  1 2015-03-01 09:30:00   
8   507108         31 2015-03-01 09:39:51      1  1 2015-03-01 10:00:00   
9   507109         31 2015-03-01 09:40:58      0  1 2015-03-01 09:00:00   
10  507109         31 2015-03-01 09:40:58      0  1 2015-03-01 09:30:00   
11  507109         31 2015-03-01 09:40:58      0  1 2015-03-01 10:00:00   

    boat_id_y     speed  
0          28  0.000000  
1          28  0.723503  
2          28  2.239399  
3          28  0.000000  
4          28  0.723503  
5          28  2.239399  
6          28  0.000000  
7          28  0.723503  
8          28  2.239399  
9          28  0.000000  
10         28  0.723503  
11         28  2.239399

Output is filtered by bin time:

df = df[((df.time >= (df.Bin_time - dt.timedelta(minutes=30))) & (df.time <= df.Bin_time ))]
df = df.drop(['Bin_time', 'id', 'boat_id_x', 'boat_id_y','state', 'i' ], axis=1 )
print df
                  time     speed
8  2015-03-01 09:39:51  2.239399
11 2015-03-01 09:40:58  2.239399

And df is merged by column time with dataframe df2.

df = df2.merge(df, on='time', how='left').reset_index(drop=True)
df = df.drop([ 'i' ], axis=1 )
print df
       id  boat_id                time  state     speed
0  319437       28 2015-01-18 16:09:03      2       NaN
1  319451       28 2015-01-18 16:18:43      0       NaN
2  507108       31 2015-03-01 09:39:51      1  2.239399
3  507109       31 2015-03-01 09:40:58      0  2.239399

Comparing vectorized and indexing approach you can found in similar answer here.

thanks for the solution, i'm just trying to make it work, when you drop the index the time column reverts to a pandas object rather than `datetime64`, so `timedelta` doesn't work. But I understand your logic and am hopeful for the solution. — hselbie, Dec 02 '15 at 19:04
One issue i've found is using the merge function on `i` takes a very long time and lot's of memory. Using `df.info` I can see the sizes are ~50mb=df2 and ~1mb=df1. Can you suggest an alternate joining method that would create the same join? I've tried `concat` and `append`. — hselbie, Dec 02 '15 at 23:48
https://www.reddit.com/r/Python/comments/33q3gy/pandas_slow_merges/ This link shows that when merging on a single value the DF size balloons and takes forever. I changed the above code to merge on the ID field and it is a million times faster... roughly. — hselbie, Dec 03 '15 at 00:55

score 0 · Answer 2 · edited May 23 '17 at 11:59

0

Adapt this:

for i in range(1, len(DF1.index)):
    DF2.between_time(DF1.index[i-1], DF1.index[i], include_start=True,
                     include_end=True).loc[:,'speed'] = DF1.loc[DF1.index[i],'speed']

With help from: how to use dataframe between_time() function

edited May 23 '17 at 11:59

Community

1
1

answered Nov 05 '15 at 21:29

Kartik

8,347
39
73

thanks @kartik, I tried and received this error message. `ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types` i've tried using `between_time` to select indexes, I was wondering if it's specifically for time rather than dates. – hselbie Nov 05 '15 at 23:03
No. I think removing the colon after the comma in iloc will do the trick. It is probably complaining that it didn't find any integers on either side of the colon. Try `.iloc[i-1,]` and `.iloc[i,]` instead of `.iloc[i-1,:]` and `.iloc[i,:]`... – Kartik Nov 05 '15 at 23:18
again thank you. `for i in range(1, len(fi.index)): df.between_time(fi.iloc[i-1,].index, fi.iloc[i,].index, include_start=True, include_end=True)['speed'] = fi.iloc[i,'speed']` same error message – hselbie Nov 05 '15 at 23:23
I figured it out... Sorry for not thinking it through at the beginning. `.loc[]` will return a Series when selecting only one row, and that will have index as the name. But there is a simpler way rather than using `.iloc`... I'm editing my answer accordingly. Please take a look and let me know if it doesn't work. – Kartik Nov 05 '15 at 23:42
Also, this is actually embarrassing. I passed `'speed'` to `.iloc[]`, and that was the cause of the error. Sorry. – Kartik Nov 05 '15 at 23:47
Still no go, /anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy app.launch_new_instance() – hselbie Nov 05 '15 at 23:54
That's a warning. You should have a result... In any case, I have changed the code to avoid the warning. – Kartik Nov 05 '15 at 23:58
Still not appending the speeds to DF2, I've looked at both data frames and exported both to CSV to make sure I haven't missed something. I'm starting to have some success using this code ` for i in range(1, len(fi.index)): if df.index[i] > fi.index[i-1] and df.index[i] < fi.index[1]: df['speed'] = fi['speed'] ` A simpler form of what you've been trying to do without using between_time. – hselbie Nov 06 '15 at 00:44
Maybe `between_time` is extracting a view of DF1, which is not propagating the changes to back to DF1... The solution will then be extracting the index, from `between_time` and using that to append into DF1 directly. – Kartik Nov 06 '15 at 01:58

Pandas, check if timestamp value exists in resampled 30 min time bin of datetimeindex

2 Answers2

Linked