8

I have a dataframe df and its first column is timedelta64

df.info():

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 686 entries, 0 to 685
Data columns (total 6 columns):
0    686 non-null timedelta64[ns]
1    686 non-null object
2    686 non-null object
3    686 non-null object
4    686 non-null object
5    686 non-null object

If I print(df[0][2]), for example, it will give me 0 days 05:01:11. However, I don't want the 0 days filed. I only want 05:01:11 to be printed. Could someone teaches me how to do this? Thanks so much!

FObersteiner
  • 22,500
  • 8
  • 42
  • 72
Chenrui Su
  • 145
  • 1
  • 1
  • 9

7 Answers7

11

It is possible by:

df['duration1'] = df['duration'].astype(str).str[-18:-10]

But solution is not general, if input is 3 days 05:01:11 it remove 3 days too.

So solution working only for timedeltas less as one day correctly.

More general solution is create custom format:

N = 10
np.random.seed(11230)
rng = pd.date_range('2017-04-03 15:30:00', periods=N, freq='13.5H')
df = pd.DataFrame({'duration': np.abs(np.random.choice(rng, size=N) - 
                                 np.random.choice(rng, size=N)) })  

df['duration1'] = df['duration'].astype(str).str[-18:-10]

def f(x):
    ts = x.total_seconds()
    hours, remainder = divmod(ts, 3600)
    minutes, seconds = divmod(remainder, 60)
    return ('{}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds)) 

df['duration2'] = df['duration'].apply(f)
print (df)

         duration duration1  duration2
0 2 days 06:00:00  06:00:00   54:00:00
1 2 days 19:30:00  19:30:00   67:30:00
2 1 days 03:00:00  03:00:00   27:00:00
3 0 days 00:00:00  00:00:00    0:00:00
4 4 days 12:00:00  12:00:00  108:00:00
5 1 days 03:00:00  03:00:00   27:00:00
6 0 days 13:30:00  13:30:00   13:30:00
7 1 days 16:30:00  16:30:00   40:30:00
8 0 days 00:00:00  00:00:00    0:00:00
9 1 days 16:30:00  16:30:00   40:30:00
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
7

Here's a short and robust version using apply():

df['timediff_string'] = df['timediff'].apply(
    lambda x: f'{x.components.hours:02d}:{x.components.minutes:02d}:{x.components.seconds:02d}'
              if not pd.isnull(x) else ''
)

This leverages the components attribute of pandas Timedelta objects and also handles empty values (NaT).

If the timediff column does not contain pandas Timedelta objects, you can convert it:

df['timediff'] = pd.to_timedelta(df['timediff'])
Simon G.
  • 441
  • 8
  • 10
3

datetime.timedelta already formats the way you'd like. The crux of this issue is that Pandas internally converts to numpy.timedelta.

import pandas as pd
from datetime import timedelta

time_1 = timedelta(days=3, seconds=3400)
time_2 = timedelta(days=0, seconds=3400)
print(time_1)
print(time_2)

times = pd.Series([time_1, time_2])

# Times are converted to Numpy timedeltas.
print(times)

# Convert to string after converting to datetime.timedelta.
times = times.apply(
    lambda numpy_td: str(timedelta(seconds=numpy_td.total_seconds())))

print(times)

So, convert to a datetime.timedelta and then str (to prevent conversion back to numpy.timedelta) before printing.

3 days, 0:56:40
0:56:400

0   3 days 00:56:40
1   0 days 00:56:40
dtype: timedelta64[ns]

0    3 days, 0:56:40
1            0:56:40
dtype: object

I came here looking for answers to the same question, so I felt I should add further clarification. : )

jayreed1
  • 152
  • 2
  • 8
2

You can convert it into a Python timedelta, then to str and finally back to a Series:

pd.Series(df["duration"].dt.to_pytimedelta().astype(str), name="start_time")
bryant1410
  • 5,540
  • 4
  • 39
  • 40
  • I tried this, but I get AttributeError: 'Timedelta' object has no attribute 'dt'. Is this answer up to date? – niid Feb 01 '22 at 08:28
  • Seems like your `df["duration"]` object is not a `pd.Series` but a `Timedelta`? If it's a `pd.Series`, it should have `dt`: https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.to_pytimedelta.html – bryant1410 Feb 01 '22 at 15:00
1

Given OP is ok with an object column (a little verbose):

def splitter(td):

  td = str(td).split(' ')[-1:][0]

  return td


df['split'] = df['timediff'].apply(splitter)

Basically we're taking the timedelta column, transforming the contents to a string, then splitting the string (creates a list) and taking the last item of that list, which would be the hh:mm:ss component.

Note that specifying ' ' for what to split by is redundant here.

Alternative one liner:

df['split2'] = df['timediff'].astype('str').str.split().str[-1]

which is very similar, but not very pretty IMHO. Also, the output includes milliseconds, which is not the case in the first solution. I'm not sure what the reason for that is (please comment if you do). If your data is big it might be worthwhile to time these different approaches.

Optimesh
  • 422
  • 1
  • 6
  • 18
1

If you can be sure that your timedelta is less than a day, this might work. To do this in as few lines as possible, I convert the timedelta to a datetime by adding the unix epoch 0 and then using the now-datetime dt function to format the date format.

df['duration1'] = (df['duration'] + pd.to_datetime(0)).dt.strftime('%M:%S')
datu-puti
  • 1,306
  • 14
  • 33
0

If wou want to remove all nonzero components (not only days), you can do it like this:


def pd_td_fmt(td):
    import pandas as pd
    abbr = {'days': 'd', 'hours': 'h', 'minutes': 'min', 'seconds': 's', 'milliseconds': 'ms', 'microseconds': 'us',
        'nanoseconds': 'ns'}

    fmt = lambda td:"".join(f"{v}{abbr[k]}" for k, v in td.components._asdict().items() if v != 0)
    if isinstance(td, pd.Timedelta):
        return fmt(td)
    elif isinstance(td,pd.TimedeltaIndex):
        return td.map(fmt)
    else:
        raise ValueError
LudvigH
  • 3,662
  • 5
  • 31
  • 49