I have an xlsx that includes a column of durations that are in the format HH:MM:SS
. I think it would be best to treat this column as a timedelta
data type but I cannot get read_excel
to do that.
To replicate this issue copy the below into an xlsx:
Duration
0 00:14:30
1 01:05:00
2 00:05:02
3 26:53:29
4 36:28:27
5 85:29:34
Then read the xlsx using similar to the below:
df = pd.read_excel('../data/test.xlsx', engine='openpyxl', index_col=0)
You should end up with a dataframe like this, with a mix of time and datetime objects:
Duration
0 00:14:30
1 01:05:00
2 00:05:02
3 1900-01-01 02:53:29
4 1900-01-01 12:28:27
5 1900-01-03 13:29:34
I have tried the read_excel
options such as dtype
and converters
. None seem to have any effect.
The only way I have found to get this data in a timedelta format is to convert everything to a datetime by adding the time values to 1900-01-01 00:00:00
using the below.
def clean_durations(s):
ss = s.copy()
for i,value in s.items():
try:
ss[i] = datetime.combine(date(1900, 1, 1), value)
except TypeError as e:
ss[i] = value + timedelta(days=1) # add an extra day for durations interpreted as datetime
ss = pd.to_datetime(ss) - pd.to_datetime('1900-01-01 00:00:00')
return ss
df = pd.read_excel('../data/test.xlsx', engine='openpyxl', index_col=0, parse_dates=True)
df['Duration'] = clean_durations(df['Duration'])
Resulting in the desired:
Duration
0 0 days 00:14:30
1 0 days 01:05:00
2 0 days 00:05:02
3 1 days 02:53:29
4 1 days 12:28:27
5 3 days 13:29:34
This feels like I am building something that should already be part of pandas.
Can anyone provide a way to achieve the same thing directly with pandas read_excel or similar?