Pandas read_excel: parsing Excel datetime field correctly

Question

I have the following sample data stored in Excel file

CLAIM	CODE1	AGE	DATE
7538	359	71	28/11/2019
7538	359	71	28/11/2019
540	428	73	16/10/2019
540	428	73	16/10/2019
605	1670	40	04/12/2019
740	134	55	24/12/2019

When importing to my Jupyter Notebook using the pandas.read_excel API, the date field does not get properly formated:

excel = pd.read_excel('Libro.xlsx')

Then I am getting the DATE field different as I have it formatted in the excel file. What argument should I apply to read_excel in order to display the DATE column formatted as I have it in the excel file?

.info() method, outputs the column as int64

I already tried using the pd.to_datetime function, but I am getting strange results:

Find the sample excel file I am using for my project in the following link sample_raw_data

Here is some code that can be used to reproduce the DataFrame that is read in from excel:

excel = pd.DataFrame({
    'CLAIM': {0: 7538, 1: 7538, 2: 540, 3: 540, 4: 4605, 5: 1740, 6: 7605},
    'CODE1': {0: 359, 1: 359, 2: 428, 3: 428, 4: 1670, 5: 134, 6: 415},
    'AGE': {0: 71, 1: 71, 2: 73, 3: 73, 4: 40, 5: 55, 6: 56},
    'DATE': {0: 43797, 1: 43797, 2: 43754, 3: 43754, 4: 43803, 5: 43823,
             6: 43818}
})

You need to use the `date_parser` argument in `pd.read_excel` — Erfan, Jul 26 '21 at 16:09
`date_parser` won't work since pandas will recognise `DATE` as an int and as such not apply the date parser. And trying to force the dtype of Date as `datetime64[ns]` will get the same error with everything being a 1970-01-01 date. @Erfan — Henry Ecker, Jul 26 '21 at 16:23

Henry Ecker · Accepted Answer · 2021-07-27T13:39:47.510

To convert this Excel Date into datetime64[ns] use to_datetime to get unit in days with offset from origin '1899-12-30':

excel = pd.read_excel('Libro.xlsx')
excel['DATE'] = pd.to_datetime(excel['DATE'], unit='d', origin='1899-12-30')

excel:

   CLAIM  CODE1  AGE       DATE
0   7538    359   71 2019-11-28
1   7538    359   71 2019-11-28
2    540    428   73 2019-10-16
3    540    428   73 2019-10-16
4   4605   1670   40 2019-12-04
5   1740    134   55 2019-12-24
6   7605    415   56 2019-12-19

info:

 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   CLAIM   7 non-null      int64         
 1   CODE1   7 non-null      int64         
 2   AGE     7 non-null      int64         
 3   DATE    7 non-null      datetime64[ns]

See Why is 1899-12-30 the zero date in Access / SQL Server instead of 12/31? for more information about why this is the base date.

A converter for DATE can also be used with read_excel:

excel = pd.read_excel(
    'Libro.xlsx',
    converters={
        'DATE': lambda x: pd.to_datetime(x, unit='d', origin='1899-12-30')
    }
)

info:

 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   CLAIM   7 non-null      int64         
 1   CODE1   7 non-null      int64         
 2   AGE     7 non-null      int64         
 3   DATE    7 non-null      datetime64[ns]

pd.to_datetime has a keyword `origin` which you can use to set the "excel epoch", no need for a timedelta — FObersteiner, Jul 26 '21 at 16:41

Pandas read_excel: parsing Excel datetime field correctly

1 Answers1

Linked

Related