2

I have an xlsx that includes a column of durations that are in the format HH:MM:SS. I think it would be best to treat this column as a timedelta data type but I cannot get read_excel to do that.

To replicate this issue copy the below into an xlsx:

    Duration
0   00:14:30
1   01:05:00
2   00:05:02
3   26:53:29
4   36:28:27
5   85:29:34

Then read the xlsx using similar to the below:

df = pd.read_excel('../data/test.xlsx', engine='openpyxl', index_col=0)

You should end up with a dataframe like this, with a mix of time and datetime objects:

    Duration
0   00:14:30
1   01:05:00
2   00:05:02
3   1900-01-01 02:53:29
4   1900-01-01 12:28:27
5   1900-01-03 13:29:34

I have tried the read_excel options such as dtype and converters. None seem to have any effect.

The only way I have found to get this data in a timedelta format is to convert everything to a datetime by adding the time values to 1900-01-01 00:00:00 using the below.

def clean_durations(s):
        ss = s.copy()
        for i,value in s.items():
            try:
                ss[i] = datetime.combine(date(1900, 1, 1), value)
            except TypeError as e:
                ss[i] = value + timedelta(days=1) # add an extra day for durations interpreted as datetime
        
        ss = pd.to_datetime(ss) - pd.to_datetime('1900-01-01 00:00:00')
        return ss

df = pd.read_excel('../data/test.xlsx', engine='openpyxl', index_col=0, parse_dates=True)

df['Duration'] = clean_durations(df['Duration'])

Resulting in the desired:

    Duration
0   0 days 00:14:30
1   0 days 01:05:00
2   0 days 00:05:02
3   1 days 02:53:29
4   1 days 12:28:27
5   3 days 13:29:34

This feels like I am building something that should already be part of pandas.

Can anyone provide a way to achieve the same thing directly with pandas read_excel or similar?

BenCaldwell
  • 323
  • 1
  • 15

1 Answers1

3

What matters is how the data is stored and saved as in your excel file. I would recommend storing duration information as a "Text" data type in excel.

df = pd.read_excel(file)
t = df['Duration'].str.split(':')
df['Duration'] = pd.to_timedelta((t.str[0].astype(int) * 3600
                                + t.str[1].astype(int) * 60
                                + t.str[2].astype(int)), unit='sec')
df

enter image description here

You can also store it as "Time", but it's risky as the moment you click into a cell... wonderful excel changes the underlying value to a float, but if you keep in "Text" format, then you don't have to worry about that.

enter image description here

David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • I'll mark this as the answer as it would fix the problem if I could easily change the way the xlsx is created. It doesn't seem trivial to convert the type in excel after the fact though. Changing the type changes the text in the cell! I will have to keep using my cleaning function. – BenCaldwell Jan 05 '21 at 04:37
  • 1
    @BenCaldwell I also included what I use to convert to the desired output. – David Erickson Jan 05 '21 at 04:38
  • @BenCaldwell True. Then, you have to ask your self what is your source to get the data into excel? When the Excel file originally gets generated, that's when you would want to ensure it gets saved as "Text" as that is the source of the issue. – David Erickson Jan 05 '21 at 04:40