1
import pandas as pd

stoptimes_df = pd.DataFrame({
    'trip_id': ['1', '1', '1', '2', '2', '2'], 
    'arrival_time': ["12:10:00", "12:20:00", "12:30:00", "27:32:00", "27:39:00", "27:45:00"],
    'departure_time': ["12:10:00", "12:20:00", "12:30:00", "27:32:00", "27:39:00", "27:45:00"],
    'stop_id': ['de:08437:48835:0:2', 'de:08426:6306', 'de:08426:6307', 'de:08116:6703', 'de:08116:3821', 'de:08415:28256:0:1']})

I have this dataframe given, which shows different bus lines (trip_id) and the different stops, and I would like to insert a new column which contains the difference between the arrival time of the following line and the departure time of the line before. Unfortunately I am not able to do this because when I change the datatype to datetime.time() I can not calculate which the times. This is only possible if I use the datatype datetime.datetime(), but then I have also a date in the columns "arrival_time" and "departure time" written, like "1900-01-01 12:10:00", which I do not want. I have a similar problem when I use timedelta. So the point is I want to keep only the times without a date in the two given columns and in the new column there should be the time difference in minutes or seconds. For example in the last line it should say in the new column 6 (min) or 300 (sec). Does someone know how to do this?

What I did so far in code:

def convert_to_datetime(time):
    hours, minutes, seconds = map(int, time.split(':'))
    hours = hours % 24  # change time format to 0-24 hours
    time = str(hours) + ':' + str(minutes) + ':' + str(seconds)
    time = datetime.strptime(time,"%H:%M:%S").time()
    # time_delta = timedelta(hours=hours, minutes=minutes, seconds=seconds)
    return time

stoptimes_df['arrival_time'] = stoptimes_df['arrival_time'].apply(convert_to_datetime)
stoptimes_df['departure_time'] = stoptimes_df['departure_time'].apply(convert_to_datetime)
stoptimes_df

# tried first with only one column to calculate
stoptimes_df['time_btw_stops'] = stoptimes_df.groupby('trip_id')['arrival_time'].diff()
stoptimes_df

This leads to the following error:

TypeError: unsupported operand type(s) for -: 'datetime.time' and 'datetime.time'

wjandrea
  • 28,235
  • 9
  • 60
  • 81
leolumpy
  • 63
  • 6
  • 1
    [Don't post pictures of code/data](//meta.stackoverflow.com/q/285551/4518341). Instead, copy the text itself and use the formatting tools like [code formatting](/editing-help#code). You'll also need to make a [mre], meaning some example input and desired output and the code you tried that didn't work. You might want to also include the changed times (27h to 03h). For specifics, see [How to make good reproducible pandas examples](/q/20109391/4518341). If you want more tips, check out [ask]. – wjandrea Jul 14 '23 at 17:48

2 Answers2

2

Since you are missing date and your time go over 24-periods which I cannot use pd.Timestamp, I would consider using a simple equation.

df = pd.DataFrame([['12:10:00'], 
                   ['12:20:00'], 
                   ['12:23:00'], 
                   ['12:29:00'], 
                   ['27:11:00']
                  ], columns=['arrival_time'])

# Convert the time string into seconds.
df['sec'] = df.arrival_time.str[0:2].astype('int') * 60 * 60  # hour
            + df.arrival_time.str[3:5].astype('int') * 60  # minute
            + df.arrival_time.str[6:8].astype('int'). # second

# Then you can do diff to calculate time diff.

# seconds diff
df.sec.diff()

# minutes diff
df.sec.diff() // 60 
Emma
  • 8,518
  • 1
  • 18
  • 35
1

Timedelta seems to be the right tool for the job here. If you don't want to see the "0 days" in the output, use styling.

Calculation

# Select only the relevant columns.
df = stoptimes_df[['trip_id', 'arrival_time']].copy()

df['arrival_time'] = pd.to_timedelta(df['arrival_time'])

df['time_btw_stops'] = df.groupby('trip_id')['arrival_time'].diff()
df
  trip_id    arrival_time  time_btw_stops
0       1 0 days 12:10:00             NaT
1       1 0 days 12:20:00 0 days 00:10:00
2       1 0 days 12:30:00 0 days 00:10:00
3       2 1 days 03:32:00             NaT
4       2 1 days 03:39:00 0 days 00:07:00
5       2 1 days 03:45:00 0 days 00:06:00

Formatting

# I put some assertions as a sanity check
def format_timedelta_hms(td):
    if pd.isna(td):
        return '-'
    # assert td.days == 0, td.days
    comp = td.components
    # assert all(getattr(comp, attr) == 0 for attr in ['milliseconds', 'microseconds', 'nanoseconds']), comp
    return f"{comp.hours:>02}:{comp.minutes:>02}:{comp.seconds:>02}"

formatted = df[['time_btw_stops']].style.format(
    formatter={'time_btw_stops': format_timedelta_hms})

print(formatted.to_string())
 time_btw_stops
0 -
1 00:10:00
2 00:10:00
3 -
4 00:07:00
5 00:06:00

Docs:

wjandrea
  • 28,235
  • 9
  • 60
  • 81