How to use pandas to add new column using if statement?

Question

Could you kindly help me to write the following concept in python pandas, I have the following datatype:

id=["Train A","Train A","Train A","Train B","Train B","Train B"]
start = ["A","B","C","D","E","F"]
end = ["G","H","I","J","K","L"]
arrival_time = ["0"," 2016-05-19 13:50:00","2016-05-19 21:25:00","0","2016-05-24 18:30:00","2016-05-26 12:15:00"]
departure_time = ["2016-05-19 08:25:00","2016-05-19 16:00:00","2016-05-20 07:25:00","2016-05-24 12:50:00","2016-05-25 23:00:00","2016-05-26 19:45:00"]
capacity = ["2","2","3","3","2","3"]

To obtain the following data:

id         arrival_time         departure_time         start  end  capacity

Train A          0                  2016-05-19 08:25:00   A     G    2
Train A   2016-05-19 13:50:00       2016-05-19 16:00:00   B     H    2
Train A   2016-05-19 21:25:00       2016-05-20 07:25:00   C     I    3
Train B          0                  2016-05-24 12:50:00   D     J    3
Train B   2016-05-24 18:30:00       2016-05-25 20:00:00   E     K    2
Train B   2016-05-26 12:15:00       2016-05-26 19:45:00   F     L    3

I would like to add a column called source and sink and if the time difference between arrival and departure is less than 3 hours, the source is the starting of the trip and the sink is only when the trip breaks (ie when time_difference is more than 3 hours,

time difference   source     sink
     -              A         H
     02:10:00       A         H
     10:00:00       C         I
     -              D         K
     01:30:00       D         K
     19:30:00       F         L

Does your if function need only information from the same row that it will ultimately update? If so, then using the "apply" function on the dataframe would work as per the answer here: http://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns But I suspect the answer you're looking for requires some cross-row comparison, is that right? — Thomas Kimber, May 10 '17 at 15:08
apply is needlessly slow here. `numpy.where` would be better. `df2 = df.assign(source_or_sink=numpy.where(, df['source'], df['sink'])` — Paul H, May 10 '17 at 15:09
thank you , i will take a look at that ! yes, if statement need only information of time difference and not necessary to create a column. and true, it has to merge two row's data if if statement is satisfied — , May 10 '17 at 15:10

Scott Boston · Accepted Answer · 2017-05-10T15:31:30.437

df = df.assign(timediff=(df.departure_time - df.arrival_time))

df = df.assign(source = np.where(df.timediff.dt.seconds / 3600 < 3, df.shift(1).start, df.start))

df = df.assign(sink = np.where(df.timediff.dt.seconds.shift(1) / 3600 > 3, df.shift(-1).end, df.end))

print(df)

Output:

        id        arrival_time      departure_time start end  capacity sink  \
0  Train A                 NaT 2016-05-19 08:25:00     A   G         2    G   
1  Train A 2016-05-19 13:50:00 2016-05-19 16:00:00     B   H         2    H   
2  Train A 2016-05-19 21:25:00 2016-05-20 07:25:00     C   I         3    I   
3  Train B                 NaT 2016-05-24 12:50:00     D   J         3    K   
4  Train B 2016-05-24 18:30:00 2016-05-25 20:00:00     E   K         2    K   
5  Train B 2016-05-26 12:15:00 2016-05-26 19:45:00     F   L         3    L   

         timediff source  
0             NaT      A  
1 0 days 02:10:00      A  
2 0 days 10:00:00      C  
3             NaT      D  
4 1 days 01:30:00      D  
5 0 days 07:30:00      F

I'm sure you meant `df.timediff` – zipa May 10 '17 at 15:28 — zipa, May 10 '17 at 15:28

How to use pandas to add new column using if statement?

1 Answers1

Linked