2

Could you kindly help me to write the following concept in python pandas, I have the following datatype:

id=["Train A","Train A","Train A","Train B","Train B","Train B"]
start = ["A","B","C","D","E","F"]
end = ["G","H","I","J","K","L"]
arrival_time = ["0"," 2016-05-19 13:50:00","2016-05-19 21:25:00","0","2016-05-24 18:30:00","2016-05-26 12:15:00"]
departure_time = ["2016-05-19 08:25:00","2016-05-19 16:00:00","2016-05-20 07:25:00","2016-05-24 12:50:00","2016-05-25 23:00:00","2016-05-26 19:45:00"]
capacity = ["2","2","3","3","2","3"]

To obtain the following data:

id         arrival_time         departure_time         start  end  capacity

Train A          0                  2016-05-19 08:25:00   A     G    2
Train A   2016-05-19 13:50:00       2016-05-19 16:00:00   B     H    2
Train A   2016-05-19 21:25:00       2016-05-20 07:25:00   C     I    3
Train B          0                  2016-05-24 12:50:00   D     J    3
Train B   2016-05-24 18:30:00       2016-05-25 20:00:00   E     K    2
Train B   2016-05-26 12:15:00       2016-05-26 19:45:00   F     L    3

I would like to add a column called source and sink and if the time difference between arrival and departure is less than 3 hours, the source is the starting of the trip and the sink is only when the trip breaks (ie when time_difference is more than 3 hours,

time difference   source     sink
     -              A         H
     02:10:00       A         H
     10:00:00       C         I
     -              D         K
     01:30:00       D         K
     19:30:00       F         L
  • Does your if function need only information from the same row that it will ultimately update? If so, then using the "apply" function on the dataframe would work as per the answer here: http://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns But I suspect the answer you're looking for requires some cross-row comparison, is that right? – Thomas Kimber May 10 '17 at 15:08
  • apply is needlessly slow here. `numpy.where` would be better. `df2 = df.assign(source_or_sink=numpy.where(, df['source'], df['sink'])` – Paul H May 10 '17 at 15:09
  • thank you , i will take a look at that ! yes, if statement need only information of time difference and not necessary to create a column. and true, it has to merge two row's data if if statement is satisfied –  May 10 '17 at 15:10
  • @PaulH Thanks ! i will try that probably ! –  May 10 '17 at 15:12

1 Answers1

2
df = df.assign(timediff=(df.departure_time - df.arrival_time))

df = df.assign(source = np.where(df.timediff.dt.seconds / 3600 < 3, df.shift(1).start, df.start))

df = df.assign(sink = np.where(df.timediff.dt.seconds.shift(1) / 3600 > 3, df.shift(-1).end, df.end))

print(df)

Output:

        id        arrival_time      departure_time start end  capacity sink  \
0  Train A                 NaT 2016-05-19 08:25:00     A   G         2    G   
1  Train A 2016-05-19 13:50:00 2016-05-19 16:00:00     B   H         2    H   
2  Train A 2016-05-19 21:25:00 2016-05-20 07:25:00     C   I         3    I   
3  Train B                 NaT 2016-05-24 12:50:00     D   J         3    K   
4  Train B 2016-05-24 18:30:00 2016-05-25 20:00:00     E   K         2    K   
5  Train B 2016-05-26 12:15:00 2016-05-26 19:45:00     F   L         3    L   

         timediff source  
0             NaT      A  
1 0 days 02:10:00      A  
2 0 days 10:00:00      C  
3             NaT      D  
4 1 days 01:30:00      D  
5 0 days 07:30:00      F
Scott Boston
  • 147,308
  • 15
  • 139
  • 187