Unable to get date column to read from xlsx file in pandas using read_excel?

Question

I have cells in Excel that are formatted as Date (see below):

I couldn't get them to be read (they were NaN) and so I used a converter to attempt to convert them to_datetime when read from Pandas read_excel method:

   cols_A8_J2007[i] = pd.read_excel(
                      i, 
                     ('sheet'+str(j)), 
                     headers = 1, skiprows = 6, nrows=2000, 
                     usecols = 'A:J', 
                     converters = {
                       'Expired': lambda x: pd.to_datetime(x, errors='coerce') , 
                   'Valid Until': lambda x: pd.to_datetime(x, errors='coerce')})

And this resulted in all of them being loaded as NaT.

So, after consulting the documentation I tried it this way:

    cols_A8_J2007[i] = pd.read_excel(i, ('sheet'+str(j)), headers = 1, parse_dates=True, skiprows = 6, nrows=2000, usecols = 'A:J' )

Which resulted in NaN again.

And finally I tried it like this instead and also got NaN once more:

    cols_A8_J2007[i] = pd.read_excel(i, ('sheet'+str(j)), headers = 1, parse_dates=True, date_parser=lambda x: pd.to_datetime(x, errors='coerce'), skiprows = 6, nrows=2000, usecols = 'A:J' )

The above did not work, because it tries to parse based on the index (see comment below).

cols_A8_J2007[i] = pd.read_excel(i, ('sheet'+str(j)), headers = 1, parse_dates=['Expired', 'Valid Until'], skiprows = 6, nrows=2000, usecols = 'A:J' )

cols_A8_J2007[i] = pd.read_excel(i, ('sheet'+str(j)), headers = 1, parse_dates=['Expired', 'Valid Until'], skiprows = 6, nrows=2000, usecols = 'A:J' )

cols_A8_J2007[i] = pd.read_excel(i, ('sheet'+str(j)), headers = 1, parse_dates=['Expired', 'Valid Until'], dateparser=lambda x: pd.to_datetime(x, errors='coerce'), skiprows = 6, nrows=2000, usecols = 'A:J' )

And both of these resulted in NaT (Not a Time?)

What else do I have to do to read a date? I realize there is no time attached, but the way that Excel stores dates and times, it shouldn't matter as the time is stored as a decimal.

for i in glob.iglob(((str(xls_folder) + '\somesheets*.xlsx'))):
    cols_A8_J2007[i] = pd.read_excel(i, ('sheet'+str(j)), headers = 1, skiprows = 6, nrows=2000, usecols = 'A:J', converters = {'Expired': lambda x: pd.to_datetime(x, errors='coerce') , 'Valid Until': lambda x: pd.to_datetime(x, errors='coerce')})

for w in cols_A8_J2007:
    print(cols_A8_J2007[w].dtypes)

Type                      object
Currency                  object
Initial Credit           float64
Credits                  float64
Debits                   float64
Balance                  float64
Reserved                   int64
Valid Until       datetime64[ns] <-  <- These I believe are what you are looking for..
Expired           datetime64[ns] <- These I believe are what you are looking for..
dtype: object

Also if this helps here are my versions:

pd.versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: 4.5.0
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.8
numpy: 1.16.4
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: 2.0.1
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.8
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

`parse_dates=True` parses the index. `parse_dates=['col1','col2']` parses the listed columns. — Quang Hoang, Oct 15 '19 at 16:50
@QuangHoang That gave me `NaT` again and I tried it both with / without `date_parser=lambda x: pd.to_datetime(x, errors='coerce')`. — leeand00, Oct 15 '19 at 16:55
You could always read them in as string and then convert to proper format but this may be unreasonable for you.. — d_kennetz, Oct 15 '19 at 17:06
@leeand00 when you changed the format - did you re-save the excel file? — Jon Clements, Oct 15 '19 at 17:10
@leeand00 okay... so what if we remove it to bare minimum of just `df = pd.read_excel('your work book.xlsx', nrows=2000)` and then check what `df.dtypes` has? — Jon Clements, Oct 15 '19 at 17:16
@JonClements I added the dtypes above but the table data is a bit lower down than like say a csv file would have, so that needs a bit of help there, that's why I was loading it like I was. — leeand00, Oct 15 '19 at 20:27
@JonClements I added the types I get again. This time there are actual types. — leeand00, Oct 17 '19 at 18:31

score 0 · Accepted Answer · answered Oct 17 '19 at 19:25

Determined the issue to be that I needed to check if the value was null using pd.isnull. The file just had too many null values for me to be able to see them in the result set. Found the answer here: how to test if a variable is pd.NaT?

me_df = pd.read_excel(i, ('iCareAcctListing'+str(j)), headers = 1, skiprows = 6, nrows=2000, usecols = 'A:J', converters = {'Expired': lambda x: pd.to_datetime(x, errors='coerce') , 'Valid Until': lambda x: pd.to_datetime(x, errors='coerce')})

# Ran this and ended up with just the dates that were 
# filled in with actual values.
#
# There were so many nulls before and after that I couldn't see any of them in the dataset!
me_df[pd.isnull(me_df['Valid Until']) != True]

Unable to get date column to read from xlsx file in pandas using read_excel?

1 Answers1