2

I have the following sample data stored in Excel file

CLAIM CODE1 AGE DATE
7538 359 71 28/11/2019
7538 359 71 28/11/2019
540 428 73 16/10/2019
540 428 73 16/10/2019
605 1670 40 04/12/2019
740 134 55 24/12/2019

When importing to my Jupyter Notebook using the pandas.read_excel API, the date field does not get properly formated:

excel = pd.read_excel('Libro.xlsx')

enter image description here

Then I am getting the DATE field different as I have it formatted in the excel file. What argument should I apply to read_excel in order to display the DATE column formatted as I have it in the excel file?

.info() method, outputs the column as int64

enter image description here

I already tried using the pd.to_datetime function, but I am getting strange results:

enter image description here

Find the sample excel file I am using for my project in the following link sample_raw_data

Here is some code that can be used to reproduce the DataFrame that is read in from excel:

excel = pd.DataFrame({
    'CLAIM': {0: 7538, 1: 7538, 2: 540, 3: 540, 4: 4605, 5: 1740, 6: 7605},
    'CODE1': {0: 359, 1: 359, 2: 428, 3: 428, 4: 1670, 5: 134, 6: 415},
    'AGE': {0: 71, 1: 71, 2: 73, 3: 73, 4: 40, 5: 55, 6: 56},
    'DATE': {0: 43797, 1: 43797, 2: 43754, 3: 43754, 4: 43803, 5: 43823,
             6: 43818}
})
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
superunknown
  • 85
  • 1
  • 4
  • 12
  • You need to use the `date_parser` argument in `pd.read_excel` – Erfan Jul 26 '21 at 16:09
  • `date_parser` won't work since pandas will recognise `DATE` as an int and as such not apply the date parser. And trying to force the dtype of Date as `datetime64[ns]` will get the same error with everything being a 1970-01-01 date. @Erfan – Henry Ecker Jul 26 '21 at 16:23

1 Answers1

10

To convert this Excel Date into datetime64[ns] use to_datetime to get unit in days with offset from origin '1899-12-30':

excel = pd.read_excel('Libro.xlsx')
excel['DATE'] = pd.to_datetime(excel['DATE'], unit='d', origin='1899-12-30')

excel:

   CLAIM  CODE1  AGE       DATE
0   7538    359   71 2019-11-28
1   7538    359   71 2019-11-28
2    540    428   73 2019-10-16
3    540    428   73 2019-10-16
4   4605   1670   40 2019-12-04
5   1740    134   55 2019-12-24
6   7605    415   56 2019-12-19

info:

 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   CLAIM   7 non-null      int64         
 1   CODE1   7 non-null      int64         
 2   AGE     7 non-null      int64         
 3   DATE    7 non-null      datetime64[ns]

See Why is 1899-12-30 the zero date in Access / SQL Server instead of 12/31? for more information about why this is the base date.


A converter for DATE can also be used with read_excel:

excel = pd.read_excel(
    'Libro.xlsx',
    converters={
        'DATE': lambda x: pd.to_datetime(x, unit='d', origin='1899-12-30')
    }
)

info:

 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   CLAIM   7 non-null      int64         
 1   CODE1   7 non-null      int64         
 2   AGE     7 non-null      int64         
 3   DATE    7 non-null      datetime64[ns]
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57