2

Currently working with an interesting transport smart card dataset. Each line in the current data represent a trip (e.g. bus trip from A to B). Any trips within 60 min needs to be grouped into journey.

The current table:

  CustomerID  SegmentID Origin Dest StartTime  EndTime  Fare    Type
0       A001        101      A    B    7:30am   7:45am   1.5     Bus
1       A001        102      B    C    7:50am   8:30am   3.5   Train
2       A001        103      C    B   17:10pm  18:00pm   3.5   Train
3       A001        104      B    A   18:10pm  18:30pm   1.5     Bus
4       A002        105      K    Y   11:30am  12:30pm   3.0   Train
5       A003        106      P    O   10:23am  11:13am   4.0  Ferrie

and covert into something like:

  CustomerID  JourneyID Origin Dest Start Time End Time  Fare        Type  NumTrips
0       A001          1      A    C     7:30am   8:30am     5  Intermodal         2
1       A001          2      C    A    17:10pm  18:30pm     5  Intermodal         2
2       A002          6      K    Y    11:30am  12:30pm     3       Train         1
3       A003          8      P    O    10:23am  11:13am     4      Ferrie         1

I'm new to Python and Pandas and have no idea how to start, so any guidance would be appreciated.

JohnE
  • 29,156
  • 8
  • 79
  • 109
JR Wink
  • 107
  • 7
  • Are you sure you want to define journey as any set of trips falling within an hour, and not the whole day tour starting / ending at home? – ako Feb 20 '16 at 06:30
  • yes ... trips falling within an hour – JR Wink Feb 20 '16 at 09:45
  • Check out these answers [pandas-combining-rows-based-on-dates](http://stackoverflow.com/questions/33264676/pandas-combining-rows-based-on-dates) and [how-to-combine-records-based-on-date](http://stackoverflow.com/questions/35188840/how-to-combine-records-based-on-date-using-python-connected-components/35193894#35193894) – RootTwo Feb 20 '16 at 18:42

1 Answers1

0

Here's a fairly complete answer. You didn't fully specify the concept of a single journey so I took a guess. You could adjust mask below to better suit your own definition.

# get rid of am/pm and convert to proper datetime
# converts to year 1900 b/c it's not specified, doesn't matter here
df['StTime'] = pd.to_datetime( df.StartTime.str[:-2], format='%H:%M' )
df['EndTime'] = pd.to_datetime( df.EndTime.str[:-2], format='%H:%M' )

# some of the later processing is easier if you use duration
# instead of arrival time
df['Duration'] = df.EndTime-df.StTime

# get rid of some nuisance variables for clarity
df = df[['CustomerID','Origin','Dest','StTime','Duration','Fare','Type']]

First, we need to figure out a way group the rows. As this is not well specified in the question, I'll group by Customer ID where Start Times are within 1 hr. Note that for tri-modal trips this actually implies that start times of the first and third trips could differ by more than one hour as long as first+second and second+third are each individaully under 1 hour. This seems like a natural way to do it, but for your actual use case you'd have to adjust this for your desired definition. There are quite a few ways you could proceed here.

mask1 = df.StTime - df.StTime.shift(1) <= pd.Timedelta('01:00:00')
mask2 = (df.CustomerID == df.CustomerID.shift(1))
mask = ( mask1 & mask2 )    

Now we can use the mask with cumsum to generate a tripID:

df['JourneyID'] = 1
df.ix[mask,'JourneyID'] = 0
df['JourneyID'] = df['JourneyID'].cumsum()
df['NumTrips'] = 1

df[['CustomerID','StTime','Fare','JourneyID']]

  CustomerID              StTime  Fare  JourneyID
0       A001 1900-01-01 07:30:00   1.5          1
1       A001 1900-01-01 07:50:00   3.5          1
2       A001 1900-01-01 17:10:00   3.5          2
3       A001 1900-01-01 18:10:00   1.5          2
4       A002 1900-01-01 11:30:00   3.0          3
5       A003 1900-01-01 10:23:00   4.0          4

Now, for each column just aggregate appropriately:

df2 = df.groupby('JourneyID').agg({ 'Origin' : sum, 'CustomerID' : min,
                                    'Dest'   : sum, 'StTime'     : min,
                                    'Fare'   : sum, 'Duration'   : sum,
                                    'Type'   : sum, 'NumTrips'   : sum })

                      StTime Dest Origin Fare Duration     Type CustomerID NumTrips
JourneyID                                                                               
1        1900-01-01 07:30:00   BC     AB    5 00:55:00 BusTrain       A001        2
2        1900-01-01 17:10:00   BA     CB    5 01:10:00 TrainBus       A001        2
3        1900-01-01 11:30:00    Y      K    3 01:00:00    Train       A002        1
4        1900-01-01 10:23:00    O      P    4 00:50:00   Ferrie       A003        1

Note that Duration includes only travel time and not the time in-between trips (e.g. if start time of second trip is later than the end time of first trip).

JohnE
  • 29,156
  • 8
  • 79
  • 109