0

I have an excel binary file that looks like this:

enter image description here

ird
1/1/2023
9/9/2023
-

When I read this data with pandas i get this:

from pathlib import WindowsPath
p = WindowsPath(r'C:\Users\data\book.xlsb')
df = pd.read_excel(p)
print(df)
print(df.dtypes)


     ird
0  44927
1  45178
2      -
ird    object
dtype: object

Why does the dates get converted to these strange numbers ? I cannot find the reason for this.

moth
  • 1,833
  • 12
  • 29
  • Maybe because the read_excel and/or print call does not fully process the cell metadata? I mean, it's all numbers anyway, at the end of the day:) – Martin James Aug 15 '23 at 10:38

1 Answers1

1

For me working converting to timedeltas and add 1899-12-30 for read native excel format with - values:

df = (pd.read_excel(p, 
                   parse_dates=['ird'],
                   date_parser = lambda x: pd.to_timedelta(x, unit='D', errors='coerce') + 
                                           pd.Timestamp('1899-12-30')))


print (df)
         ird
0 2023-01-01
1 2023-09-09
2        NaT
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • i tried all 3, it is not working for me – moth Aug 15 '23 at 10:51
  • not sure if `-` is affecting the parsing – moth Aug 15 '23 at 10:52
  • 1
    @moth - Is possible share sample data by gdocs, dropbox... ? – jezrael Aug 15 '23 at 10:54
  • work laptop , hard to access these domains. but essentially i wrote quickly the data in excel and saved into xlsb, only 3 rows. last row is `-` – moth Aug 15 '23 at 11:00
  • 1
    @moth - Answer was edited. – jezrael Aug 15 '23 at 11:25
  • thanks that works, but why the `D` unit ? is it because excel parses as days after a certain date ? is this date 1899-12-30 ? could you please give me some background on why this specific date. thanks – moth Aug 16 '23 at 05:27
  • 1
    @moth - Because excel datetimes use days, check [this](https://stackoverflow.com/questions/38454403/convert-excel-style-date-with-pandas) – jezrael Aug 16 '23 at 05:31
  • thanks, I see that `date_parser` is being deprecated to `date_format` in pandas `2.0` – moth Aug 16 '23 at 07:26
  • 1
    @moth - I think then need post processing like [this](https://stackoverflow.com/questions/38454403/convert-excel-style-date-with-pandas) with `df = pd.read_excel(p)` – jezrael Aug 16 '23 at 08:30
  • treid using that : `df['ird'] = df.apply(lambda x : pd.to_timedelta(x, unit='D', errors='coerce') + pd.Timestamp('1899-12-30'))` but getting this : `pandas._libs.tslibs.np_datetime.OutOfBoundsTimedelta: Cannot convert 1107496 days to timedelta64[ns] without overflow` – moth Aug 16 '23 at 09:37
  • @moth - Use `df['ird'] = pd.to_timedelta(df['ird'], unit='D', errors='coerce') + pd.Timestamp('1899-12-30')` – jezrael Aug 16 '23 at 09:57