3

I have created a resampled data frame (DF1) in pandas with a datetimeindex. I have a separate dataframe (DF2) with a datetimeindex and time column. If an instance of time from DF2 falls within the 30 min bins of datetimeindex in DF1. I want to mark each instance of time in DF2 with the appropriate speed from the 30 min bin in DF1.

DF1

                   boat_id      speed
time                                      
2015-01-13 09:00:00   28.000000   0.000000
2015-01-13 09:30:00   28.000000   0.723503
2015-01-13 10:00:00   28.000000   2.239399

DF2

                      id  boat_id                 time  state     
time                                                                          
2015-01-18 16:09:03   319437       28  2015-01-18 16:09:03      2    
2015-01-18 16:18:43   319451       28  2015-01-18 16:18:43      0    
2015-03-01 09:39:51   507108       31  2015-03-01 09:39:51      1    
2015-03-01 09:40:58   507109       31  2015-03-01 09:40:58      0 

Desired Result

                      id  boat_id                 time      state   speed
time                                                                          
2015-01-18 16:09:03   319437       28  2015-01-18 16:09:03      2 nan   
2015-01-18 16:18:43   319451       28  2015-01-18 16:18:43      0 nan   
2015-03-01 09:39:51   507108       31  2015-03-01 09:39:51      1 2.239399   
2015-03-01 09:40:58   507109       31  2015-03-01 09:40:58      0 2.239399

I created this script to try and do this but I think it's failing because datetimeindex of DF1 is immutable and so my timedelta request doesn't create a start point for the chunk. One thought I had was if it would be possible to copy the datetimeindex of DF1 into a new column where the objects are mutable but I haven't managed it yet so am not 100% sure of the logic. I'm happy to tinker but at the moment I've been stalled for a while so was hoping someone else might have a few ideas.

for row in DF1.iterrows():
    for dfrow in DF2.iterrows():
        if dfrow[0] > row[0] - dt.timedelta(minutes=30) and dfrow[0] < row[0]:
            df['test'] =  row[1]
Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
hselbie
  • 1,749
  • 9
  • 24
  • 40
  • Why don't you try [`between_time()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.between_time.html)? – Kartik Nov 05 '15 at 20:31
  • @Kartik, Thanks for the suggestion, I tried `for row in DF1.iterrows(): for dfrow in DF2.iterrows(): if dfrow['time'] == DF1[pd.DataFrame.between_time(row[0] - dt.timedelta(minutes=15), row[0])]: df['test'] = row[1]` but got the result `TypeError: tuple indices must be integers, not str` If I use an int instead of string I get either `IndexError: tuple index out of range` or `TypeError: unbound method between_time() must be called with DataFrame instance as first argument (got Timestamp instance instead)` Am I missing something? – hselbie Nov 05 '15 at 20:51
  • I wonder if it has something to do with the headers in the dataframes? – hselbie Nov 05 '15 at 21:01
  • Dude! That is wrong! Wait, let me answer... – Kartik Nov 05 '15 at 21:18

2 Answers2

1

Performance of iterating is very low. Better is use vectorized solution. I use twice function merge. Docs.

Input:

print df1
                     boat_id     speed
time                                  
2015-03-01 09:00:00       28  0.000000
2015-03-01 09:30:00       28  0.723503
2015-03-01 10:00:00       28  2.239399

print df2
                         id  boat_id                time  state
time                                                           
2015-01-18 16:09:03  319437       28 2015-01-18 16:09:03      2
2015-01-18 16:18:43  319451       28 2015-01-18 16:18:43      0
2015-03-01 09:39:51  507108       31 2015-03-01 09:39:51      1
2015-03-01 09:40:58  507109       31 2015-03-01 09:40:58      0

I reset index of both dataframes and create helper column i filled by 1.

df1 = df1.reset_index()
df2 = df2.reset_index(drop=True)
df1['i'] =  df2['i'] = 1
print df1
                 time  boat_id     speed  i
0 2015-03-01 09:00:00       28  0.000000  1
1 2015-03-01 09:30:00       28  0.723503  1
2 2015-03-01 10:00:00       28  2.239399  1
print df2
       id  boat_id                time  state  i
0  319437       28 2015-01-18 16:09:03      2  1
1  319451       28 2015-01-18 16:18:43      0  1
2  507108       31 2015-03-01 09:39:51      1  1
3  507109       31 2015-03-01 09:40:58      0  1

Then I merged both dataframes by helper column i.

df = df2.merge(df1, on='i', how='left')
df = df.rename(columns={'time_y':'Bin_time', 'time_x':'time'})
print df
        id  boat_id_x                time  state  i            Bin_time  \
0   319437         28 2015-01-18 16:09:03      2  1 2015-03-01 09:00:00   
1   319437         28 2015-01-18 16:09:03      2  1 2015-03-01 09:30:00   
2   319437         28 2015-01-18 16:09:03      2  1 2015-03-01 10:00:00   
3   319451         28 2015-01-18 16:18:43      0  1 2015-03-01 09:00:00   
4   319451         28 2015-01-18 16:18:43      0  1 2015-03-01 09:30:00   
5   319451         28 2015-01-18 16:18:43      0  1 2015-03-01 10:00:00   
6   507108         31 2015-03-01 09:39:51      1  1 2015-03-01 09:00:00   
7   507108         31 2015-03-01 09:39:51      1  1 2015-03-01 09:30:00   
8   507108         31 2015-03-01 09:39:51      1  1 2015-03-01 10:00:00   
9   507109         31 2015-03-01 09:40:58      0  1 2015-03-01 09:00:00   
10  507109         31 2015-03-01 09:40:58      0  1 2015-03-01 09:30:00   
11  507109         31 2015-03-01 09:40:58      0  1 2015-03-01 10:00:00   

    boat_id_y     speed  
0          28  0.000000  
1          28  0.723503  
2          28  2.239399  
3          28  0.000000  
4          28  0.723503  
5          28  2.239399  
6          28  0.000000  
7          28  0.723503  
8          28  2.239399  
9          28  0.000000  
10         28  0.723503  
11         28  2.239399  

Output is filtered by bin time:

df = df[((df.time >= (df.Bin_time - dt.timedelta(minutes=30))) & (df.time <= df.Bin_time ))]
df = df.drop(['Bin_time', 'id', 'boat_id_x', 'boat_id_y','state', 'i' ], axis=1 )
print df
                  time     speed
8  2015-03-01 09:39:51  2.239399
11 2015-03-01 09:40:58  2.239399

And df is merged by column time with dataframe df2.

df = df2.merge(df, on='time', how='left').reset_index(drop=True)
df = df.drop([ 'i' ], axis=1 )
print df
       id  boat_id                time  state     speed
0  319437       28 2015-01-18 16:09:03      2       NaN
1  319451       28 2015-01-18 16:18:43      0       NaN
2  507108       31 2015-03-01 09:39:51      1  2.239399
3  507109       31 2015-03-01 09:40:58      0  2.239399

Comparing vectorized and indexing approach you can found in similar answer here.

Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thanks for the solution, i'm just trying to make it work, when you drop the index the time column reverts to a pandas object rather than `datetime64`, so `timedelta` doesn't work. But I understand your logic and am hopeful for the solution. – hselbie Dec 02 '15 at 19:04
  • Brilliant, it was the `timedelta` issue I outlined above. – hselbie Dec 02 '15 at 19:13
  • One issue i've found is using the merge function on `i` takes a very long time and lot's of memory. Using `df.info` I can see the sizes are ~50mb=df2 and ~1mb=df1. Can you suggest an alternate joining method that would create the same join? I've tried `concat` and `append`. – hselbie Dec 02 '15 at 23:48
  • https://www.reddit.com/r/Python/comments/33q3gy/pandas_slow_merges/ This link shows that when merging on a single value the DF size balloons and takes forever. I changed the above code to merge on the ID field and it is a million times faster... roughly. – hselbie Dec 03 '15 at 00:55
0

Adapt this:

for i in range(1, len(DF1.index)):
    DF2.between_time(DF1.index[i-1], DF1.index[i], include_start=True,
                     include_end=True).loc[:,'speed'] = DF1.loc[DF1.index[i],'speed']

With help from: how to use dataframe between_time() function

Community
  • 1
  • 1
Kartik
  • 8,347
  • 39
  • 73
  • thanks @kartik, I tried and received this error message. `ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types` i've tried using `between_time` to select indexes, I was wondering if it's specifically for time rather than dates. – hselbie Nov 05 '15 at 23:03
  • No. I think removing the colon after the comma in iloc will do the trick. It is probably complaining that it didn't find any integers on either side of the colon. Try `.iloc[i-1,]` and `.iloc[i,]` instead of `.iloc[i-1,:]` and `.iloc[i,:]`... – Kartik Nov 05 '15 at 23:18
  • again thank you. `for i in range(1, len(fi.index)): df.between_time(fi.iloc[i-1,].index, fi.iloc[i,].index, include_start=True, include_end=True)['speed'] = fi.iloc[i,'speed']` same error message – hselbie Nov 05 '15 at 23:23
  • I figured it out... Sorry for not thinking it through at the beginning. `.loc[]` will return a Series when selecting only one row, and that will have index as the name. But there is a simpler way rather than using `.iloc`... I'm editing my answer accordingly. Please take a look and let me know if it doesn't work. – Kartik Nov 05 '15 at 23:42
  • Also, this is actually embarrassing. I passed `'speed'` to `.iloc[]`, and that was the cause of the error. Sorry. – Kartik Nov 05 '15 at 23:47
  • Still no go, /anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy app.launch_new_instance() – hselbie Nov 05 '15 at 23:54
  • That's a warning. You should have a result... In any case, I have changed the code to avoid the warning. – Kartik Nov 05 '15 at 23:58
  • Still not appending the speeds to DF2, I've looked at both data frames and exported both to CSV to make sure I haven't missed something. I'm starting to have some success using this code ` for i in range(1, len(fi.index)): if df.index[i] > fi.index[i-1] and df.index[i] < fi.index[1]: df['speed'] = fi['speed'] ` A simpler form of what you've been trying to do without using between_time. – hselbie Nov 06 '15 at 00:44
  • Maybe `between_time` is extracting a view of DF1, which is not propagating the changes to back to DF1... The solution will then be extracting the index, from `between_time` and using that to append into DF1 directly. – Kartik Nov 06 '15 at 01:58