1

I have a DF created by merging my original DF and a resampled version of the original. The resampled version is Bin_time and ave_knots which was merged on the joint field ID to create this DF.

    id      trip_id knots       times           Bin_time        ave_knots
0   5045493 10789   6.4924256   5/6/15 17:48    5/6/15 17:30    3.376727771
1   5045494 10789   12.537768   5/6/15 17:53    5/6/15 18:00    2.592866578
2   5045495 10789   9.136048    5/6/15 18:03    5/6/15 18:30    1.109755927
3   5045496 10789   6.4341104   5/6/15 18:04    5/6/15 19:00    1.447413169
4   5045497 10789   10.7688736  5/6/15 18:04    5/6/15 19:30    1.900643556
5   5045498 10789   10.41250293 5/6/15 18:09        
6   5045499 10789   11.22891573 5/6/15 18:41        
7   5045500 10789   12.04532853 5/6/15 18:46        
8   5045501 10789   12.86174133 5/6/15 19:08        

What i'm trying to do is where the timestamps in times fit into the resampled Bin_time, append the Bin_time and ave_knots to the end of the row. This is a paired down data example, ultimately it will have to work on 3 - 5 million rows.

I have tried a few different methods, e.g. df['test'] = np.where(df.times > (df.Bin_time - dt.timedelta(minutes=30)) & (df.times < df.Bin_time )) and merging on a single value helper column as suggested here. The problem with this solution is it created a DF that was enormous, by merging the resampled DF with each entry of the original DF. This crashed my computer while trying to run the script on the larger DF's. I have got it to work if I read the DF's into CSV's but this solution is slow and inefficient so i'm trying to keep in Pandas because I believe it should offer a faster solution.

Desired Result

    id      trip_id knots       times           Bin_time        ave_knots
0   5045493 10789   6.4924256   5/6/15 17:48    5/6/15 17:30    3.376727771
1   5045494 10789   12.537768   5/6/15 17:53    5/6/15 17:30    3.376727771
2   5045495 10789   9.136048    5/6/15 18:03    5/6/15 18:00    2.592866578
3   5045496 10789   6.4341104   5/6/15 18:04    5/6/15 18:00    2.592866578
4   5045497 10789   10.7688736  5/6/15 18:04    5/6/15 18:00    2.592866578
5   5045498 10789   10.41250293 5/6/15 18:09    5/6/15 18:00    2.592866578 
6   5045499 10789   11.22891573 5/6/15 18:41    5/6/15 18:30    1.109755927 
7   5045500 10789   12.04532853 5/6/15 18:46    5/6/15 18:30    1.109755927 
8   5045501 10789   12.86174133 5/6/15 19:08    5/6/15 19:00    1.447413169
Community
  • 1
  • 1
hselbie
  • 1,749
  • 9
  • 24
  • 40
  • In the top dataframe, why did you merge `Bin_time` and `ave_knots` with the other columns? From your desired output, it looks like they have no direct relation (i.e. the bin time of `5/6/15 19:00` doesn't actually belong at index 3 / id 5045496), so wouldn't it be better to keep those columns in a separate DataFrame? Then you could reindex it to the appropriate "times" with ffill. – Ajean Dec 03 '15 at 18:46
  • @Ajean, I merged because I thought it might make a difference to have everything in one dataframe and then I could do some kind of conditional selection. This is my first experience with pandas so am learning as I go. Regarding reindexing to appropriate `times`, with ffill, can you expand a little? – hselbie Dec 03 '15 at 18:52

1 Answers1

2

I would start with keeping the resampled output separate from the original DataFrame. I've copied your examples into the following code in a way that hopefully mimics your actual data (note that the date columns should be interpreted as actual datetime objects, or this won't work).

import pandas as pd
from StringIO import StringIO

df = pd.read_table(StringIO("""
id      trip_id knots       times       
5045493 10789   6.4924256   5/6/15-17:48
5045494 10789   12.537768   5/6/15-17:53
5045495 10789   9.136048    5/6/15-18:03
5045496 10789   6.4341104   5/6/15-18:04
5045497 10789   10.7688736  5/6/15-18:04
5045498 10789   10.41250293 5/6/15-18:09
5045499 10789   11.22891573 5/6/15-18:41
5045500 10789   12.04532853 5/6/15-18:46
5045501 10789   12.86174133 5/6/15-19:08"""), sep='\s+', parse_dates=[3])

df2 = pd.read_table(StringIO("""
Bin_time        ave_knots
5/6/15-17:30    3.376727771
5/6/15-18:00    2.592866578
5/6/15-18:30    1.109755927
5/6/15-19:00    1.447413169
5/6/15-19:30    1.900643556"""), sep='\s+', parse_dates=[0])

df2 = df2.set_index('Bin_time', drop=False)

df2 = df2.reindex(df['times'], method='ffill').reset_index(drop=True)

df = pd.concat([df, df2[['Bin_time', 'ave_knots']]], axis=1)

This works by setting the index on df2 to the Bin_time column, so that reindex using "forward fill" will automatically put the bins in the right place. You can read the description here on filling missing data.

You then have to use reset_index to get back the straightforward integer index that exists in df so that pd.concat will put them in the right place.

I don't know if this is the most memory-efficient, but it should work pretty cleanly.

The output df looks like this:

        id  trip_id      knots               times            Bin_time   ave_knots 
0  5045493    10789   6.492426 2015-05-06 17:48:00 2015-05-06 17:30:00   3.376728  
1  5045494    10789  12.537768 2015-05-06 17:53:00 2015-05-06 17:30:00   3.376728  
2  5045495    10789   9.136048 2015-05-06 18:03:00 2015-05-06 18:00:00   2.592867  
3  5045496    10789   6.434110 2015-05-06 18:04:00 2015-05-06 18:00:00   2.592867  
4  5045497    10789  10.768874 2015-05-06 18:04:00 2015-05-06 18:00:00   2.592867  
5  5045498    10789  10.412503 2015-05-06 18:09:00 2015-05-06 18:00:00   2.592867  
6  5045499    10789  11.228916 2015-05-06 18:41:00 2015-05-06 18:30:00   1.109756  
7  5045500    10789  12.045329 2015-05-06 18:46:00 2015-05-06 18:30:00   1.109756  
8  5045501    10789  12.861741 2015-05-06 19:08:00 2015-05-06 19:00:00   1.447413  
Ajean
  • 5,528
  • 14
  • 46
  • 69
  • THANK YOU for this, you have no idea how long i've been bashing my head against this problem. If it's not too much trouble can you explain how the reindex(ffill method works)? – hselbie Dec 03 '15 at 23:52
  • Also on the larger files, my processing time has gone from multiple hours on the largest ones to 54 secs. Pandas rocks. – hselbie Dec 04 '15 at 00:00
  • It's a bit hard to explain in a comment, so I'd suggest spending some quality time with the docs. At a high level, `reindex` just samples the current columns at new index positions (very handy for date/time functionality). If the new index doesn't have an exact match in the old index, it makes a null value there. The 'method' parameter determines how those get filled - 'ffill' will take valid values in the old index and propagate them forward for any null values that occur after them, until they hit the next valid value. – Ajean Dec 04 '15 at 00:44