Pandas problem with a column with mixed time and date time

Question

I have a column that came from Excel, that is supposed to contain durations (in hours) - example: 02:00:00 -
It works well if all this durations are less than 24:00 but if one is more than that, it appears in pandas as 1900-01-03 08:00:00 (so datetime) as a result the datatype is dtype('O').

df = pd.DataFrame({'duration':[datetime.time(2, 0), datetime.time(2, 0),
       datetime.datetime(1900, 1, 3, 8, 0),
       datetime.datetime(1900, 1, 3, 8, 0),
       datetime.datetime(1900, 1, 3, 8, 0),
       datetime.datetime(1900, 1, 3, 8, 0),
       datetime.datetime(1900, 1, 3, 8, 0),
       datetime.datetime(1900, 1, 3, 8, 0), datetime.time(1, 0),
       datetime.time(1, 0)]})

# Output
    duration
0   02:00:00
1   02:00:00
2   1900-01-03 08:00:00
3   1900-01-03 08:00:00
4   1900-01-03 08:00:00
5   1900-01-03 08:00:00
6   1900-01-03 08:00:00
7   1900-01-03 08:00:00
8   01:00:00
9   01:00:00

But if I try to convert to either time or datetime I always get an error.

TypeError: <class 'datetime.time'> is not convertible to datetime

Today if I don't fix this, all the duration greater than 24:00 are gone.

Could you edit the question to include the code which produces that type error? — Kraigolas, Jan 26 '22 at 03:32
If you try to get it to a datetime, maybe try to extract hour from it. https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.hour.html — Emily Kuo, Jan 26 '22 at 03:33
It would be very helpful if you provide your sample data, code your tried, and expected output. — Park, Jan 26 '22 at 03:36
just for the record, here's three related posts on SO, [#1](https://stackoverflow.com/q/35579042/10197418), [#2](https://stackoverflow.com/q/63067926/10197418), [#3](https://stackoverflow.com/q/34156830/10197418) - although the questions are not exactly the same, I think they go back to the same root. Imho, Python should not make the same mistake as Excel and think too much for the user ?! - just leave date/times as string if not specified otherwise. — FObersteiner, Jan 26 '22 at 11:02

FObersteiner · Accepted Answer · 2022-01-27T07:57:42.417

Your problem lies in the engine that reads the Excel file. It converts cells that have a certain format (e.g. [h]:mm:ss or hh:mm:ss) to datetime.datetime or datetime.time objects. Those then get transferred into the pandas DataFrame, so it's not actually a pandas problem.

Before you start hacking the excel reader engine, it might be easier to tackle the issue in Excel. Here's a small sample file;

You can download it here. duration is auto-formatted by Excel, duration_text is what you get if you set the column format to 'text' before you enter the values, duration_to_text is what you get if you change the format to text after Excel auto-formatted the values (first column).

Now you have everything you need after import with pandas:

df = pd.read_excel('path_to_file')

df
              duration duration_text  duration_to_text
0             12:30:00      12:30:00          0.520833
1  1900-01-01 00:30:00      24:30:00          1.020833

# now you can parse to timedelta:
pd.to_timedelta(df['duration_text'], errors='coerce')
0   0 days 12:30:00
1   1 days 00:30:00
Name: duration_text, dtype: timedelta64[ns]

# or
pd.to_timedelta(df['duration_to_text'], unit='d', errors='coerce') 
0   0 days 12:29:59.999971200                     # note the precision issue ;-)
1   1 days 00:29:59.999971200
Name: duration_to_text, dtype: timedelta64[ns]

Another viable option could be to save the Excel file as a csv and import that to a pandas DataFrame. The sample xlsx used above would then look like this for example.

If you have no other option than to re-process in pandas, an option could be to treat datetime.time objects and datetime.datetime objects specifically, e.g.

import datetime

# where you have datetime (incorrect from excel)
m = [isinstance(i, datetime.datetime) for i in df.duration]

# convert to timedelta where it's possible
df['timedelta'] = pd.to_timedelta(df['duration'].astype(str), errors='coerce')

# where you have datetime, some special treatment is needed...
df.loc[m, 'timedelta'] = df.loc[m, 'duration'].apply(lambda t: pd.Timestamp(str(t)) - pd.Timestamp('1899-12-31'))

df['timedelta'] 
0   0 days 12:30:00
1   1 days 00:30:00
Name: timedelta, dtype: timedelta64[ns]

Yes that's true that the problem comes from the way the excel data is interpreted. Unfortunately, I can't make changes in the excel. There must be a way in Pandas to deal with this 2 different format in the same column, no ? — didier severac, Jan 26 '22 at 09:14
@didierseverac yes you *can* deal with it, but it's painful imho ;-) I'll add an option... — FObersteiner, Jan 26 '22 at 09:23
the loc gives me this error "ValueError: Must have equal len keys and value when setting with an iterable" — didier severac, Jan 27 '22 at 03:25
@didierseverac I've updated my answer to correctly handle NaT (can't use generator expression to derive `m`). — FObersteiner, Jan 27 '22 at 07:58
@didierseverac glad your issue is solved! I still wonder if this problem can be grabbed by the root, the excel reader engine ^^ — FObersteiner, Jan 27 '22 at 08:09
Ideally, it would be best if when reading Excel, if it was possible to specify the datatype for each column (or even better, to force datatype only for some specifics ) — didier severac, Jan 27 '22 at 08:21
@didierseverac I've added some thoughts [here](https://stackoverflow.com/a/70884740/10197418). In principle, I don't think you can specify the type you want before import, because you would also have to specify the conversion from *any type* to the type you want. — FObersteiner, Jan 27 '22 at 19:42

Corralien · Answer 2 · 2022-01-26T09:30:20.273

1

IIUC, use pd.to_timedelta:

Setup a MRE:

df = pd.DataFrame({'duration': ['43:24:57', '22:12:52', '-', '78:41:33']})
print(df)

# Output
   duration
0  43:24:57
1  22:12:52
2         -
3  78:41:33

df['duration'] = pd.to_timedelta(df['duration'], errors='coerce')
print(df)

# Output
         duration
0 1 days 19:24:57
1 0 days 22:12:52
2             NaT
3 3 days 06:41:33

Update

@MrFuppes Excel file is exactly what I have in my column 'duration'

Try:

df['duration'] = np.where(df['duration'].apply(len) == 8,
                          '1899-12-31 ' + df['duration'], df['duration'])
df['duration'] = pd.to_datetime(df['duration'], errors='coerce') \
                     - pd.Timestamp('1899-12-31')
print(df)

# Output (with a slightly modified example of @MrFuppes)
         duration
0 0 days 12:30:00
1 1 days 00:30:00
2             NaT

edited Jan 26 '22 at 09:30

answered Jan 26 '22 at 03:50

Corralien

109,409
8
28
52

Does it solve your problem? – Corralien Jan 26 '22 at 04:35
Although the Excel input file looks like this, I do not have homogeneous data in that column (in fact your input is what I would like to have) I have a mix of properly formated hours (below 24:00) and dates (for those where the time is greater than 24:00 If try : `code` df['duration'] = pd.to_timedelta(df['duration'], errors='coerce') on my df, I get this : #Output duration 0 NaT 1 NaT 2 NaT 3 NaT 4 NaT 5 NaT 6 NaT 7 NaT 8 NaT 9 NaT `code` – didier severac Jan 26 '22 at 08:59
Pandas does not interpret data, it reads data as in your file (which is probably not what you see to Excel). Please export your data to a csv file and upload a sample from the content of the csv. – Corralien Jan 26 '22 at 09:06
MrFuppes Excel file is exactly what I have in my column 'duration' – didier severac Jan 26 '22 at 09:16
@didierseverac. I updated my answer, can you check it please? – Corralien Jan 26 '22 at 09:31
@Corralien, It works for the dates but for the times alone it gives me "TypeError: object of type 'NaTType' has no len()" But I think that you are right, filtering by length could be the answer – didier severac Jan 27 '22 at 03:19

Pandas problem with a column with mixed time and date time

2 Answers2

Linked