1

I have extracted a dataframe from an excel file that has a datetime column but with a few values in the excel Date format like so:

import pandas as pd
import numpy as np
import xlrd

rnd = np.random.randint(0,1000,size=(10, 1))
test = pd.DataFrame(data=rnd,index=range(0,10),columns=['rnd'])
test['Date'] = pd.date_range(start='1/1/1979', periods=len(test), freq='D')
r1 = np.random.randint(0,5)
r2 = np.random.randint(6,10)
test.loc[r1, 'Date'] = 44305
test.loc[r2, 'Date'] = 44287
test

    rnd     Date
0   56  1979-01-01 00:00:00
1   557     1979-01-02 00:00:00
2   851     1979-01-03 00:00:00
3   553     44305
4   258     1979-01-05 00:00:00
5   946     1979-01-06 00:00:00
6   930     1979-01-07 00:00:00
7   805     1979-01-08 00:00:00
8   362     44287
9   705     1979-01-10 00:00:00

When I attempt to convert the errant dates using the xlrd.xldate_as_datetime function in isolation I get a series with the correct format:

# Identifying the index of dates in int format
idx_ints = test[test['Date'].map(lambda x: isinstance(x, int))].index

test.loc[idx_ints, 'Date'].map(lambda x: xlrd.xldate_as_datetime(x, 0))

3   2021-04-19
8   2021-04-01
Name: Date, dtype: datetime64[ns]

However when I attempt to apply the change in place I get a wildly different int:

test.loc[idx_ints,'Date'] = test.loc[idx_ints, 'Date'].map(lambda x: xlrd.xldate_as_datetime(x, 0))

test
  
    rnd     Date
0   56  1979-01-01 00:00:00
1   557     1979-01-02 00:00:00
2   851     1979-01-03 00:00:00
3   553     1618790400000000000
4   258     1979-01-05 00:00:00
5   946     1979-01-06 00:00:00
6   930     1979-01-07 00:00:00
7   805     1979-01-08 00:00:00
8   362     1617235200000000000
9   705     1979-01-10 00:00:00

Any idea, or perhaps an alternative solution to my date int conversion problem, thanks!

at8865
  • 107
  • 8
  • related: [Excel Datetime SN Conversion in Python](https://stackoverflow.com/q/65514678/10197418) – FObersteiner Jan 29 '21 at 13:30
  • Thanks it's related, but it gets me to the exact same result above when I try to update the dataframe – at8865 Jan 29 '21 at 14:59
  • I find it actually surprising that `xlrd.xldate_as_datetime` leaves timestamps in ns since the epoch (Unix time) instead of converting everything to pandas datetime – FObersteiner Jan 29 '21 at 15:19

1 Answers1

1

Reversing the logic from the answer I linked, this works fine for your test df:

# where you have numeric values, i.e. "excel datetime format":
nums = pd.to_numeric(test['Date'], errors='coerce') # timestamps will give NaN here
# now first convert the excel dates:
test.loc[nums.notna(), 'datetime'] = pd.to_datetime(nums[nums.notna()], unit='d', origin='1899-12-30')
# ...and the other, "parseable" timestamps:
test.loc[nums.isna(), 'datetime'] = pd.to_datetime(test['Date'][nums.isna()])

test
   rnd                 Date   datetime
0  840                44305 2021-04-19
1  298  1979-01-02 00:00:00 1979-01-02
2  981  1979-01-03 00:00:00 1979-01-03
3  806  1979-01-04 00:00:00 1979-01-04
4  629  1979-01-05 00:00:00 1979-01-05
5  540  1979-01-06 00:00:00 1979-01-06
6  499  1979-01-07 00:00:00 1979-01-07
7  155  1979-01-08 00:00:00 1979-01-08
8  208                44287 2021-04-01
9  737  1979-01-10 00:00:00 1979-01-10

If your input already has datetime objects instead of timestamp strings, you could skip the conversion and just transfer the values to the new column I think.

FObersteiner
  • 22,500
  • 8
  • 42
  • 72
  • Thanks for this - it works if I create a new column like you've done, otherwise if I try to update the existing column, the result is the epoch time – at8865 Jan 29 '21 at 17:23
  • @at8865, hm you should also be able to make the conversion "in place", with `test.loc[nums.notna(), 'Date']`. You might need to add another `pd.to_datetime(test.Date)` so that every item of the Series has the same dtype. – FObersteiner Jan 29 '21 at 17:28