12

Python 2.7.10
Tried pandas 0.17.1 -- function read_excel
Tried pyexcel 0.1.7 + pyexcel-xlsx 0.0.7 -- function get_records()

When using pandas in Python is it possible to read excel files (formats: xls, xlsx) and leave columns containing date or date + time values as strings rather than auto-converting to datetime.datetime or timestamp types?

If this is not possible using pandas can someone suggest an alternate method/library to read xls, xlsx files and leave date column values as strings?

For the pandas solution attempts the df.info() and resulting date column types are shown below:

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 117 entries, 0 to 116
Columns: 176 entries, Mine to Index
dtypes: datetime64[ns](2), float64(145), int64(26), object(3)
memory usage: 161.8+ KB
>>> type(df['Start Date'][0])
Out[6]: pandas.tslib.Timestamp
>>> type(df['End Date'][0])
Out[7]: pandas.tslib.Timestamp

Attempt/Approach 1:

def read_as_dataframe(filename, ext):
   import pandas as pd
   if ext in ('xls', 'xlsx'):
      # problem: date columns auto converted to datetime.datetime or timestamp!
      df = pd.read_excel(filename) # unwanted - date columns converted!

   return df, name, ext

Attempt/Approach 2:

import pandas as pd
# import datetime as datetime
# parse_date = lambda x: datetime.strptime(x, '%Y%m%d %H')
parse_date = lambda x: x
elif ext in ('xls', 'xlsx', ):
    df = pd.read_excel(filename, parse_dates=False)
    date_cols = [df.columns.get_loc(c) for c in df.columns if c in ('Start Date', 'End Date')]
    # problem: date columns auto converted to datetime.datetime or timestamp!
    df = pd.read_excel(filename, parse_dates=date_cols, date_parser=parse_date)

And have also tried pyexcel library but it does the same auto-magic convert behavior:

Attempt/Approach 3:

import pyexcel as pe
import pyexcel.ext.xls
import pyexcel.ext.xlsx

t0 = time.time()
if ext == 'xlsx':
    records = pe.get_records(file_name=filename)
    for record in records:
        print("start date = %s (type=%s), end date = %s (type=%s)" %
              (record['Start Date'],
               str(type(record['Start Date'])),
               record['End Date'],
               str(type(record['End Date'])))
              )
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
MattB
  • 121
  • 1
  • 1
  • 5
  • I don't understand your question, if you don't pass `date_cols` or a `date_parser` then it won't try to parse the date strings – EdChum Dec 08 '15 at 13:15
  • 1
    As you can see from my code snippets I am not passing date_cols or date_parser for first style approach above and pandas is auto-magically (and silently) converting the columns to **datetime** values. I need to find a way to stop that and leave any column containing a date value as a string. – MattB Dec 08 '15 at 13:19
  • Are you sure, please post `df.info()` after running your code – EdChum Dec 08 '15 at 13:25
  • @EdChum have edited question with df.info() and type result for the two auto-converted date columns. – MattB Dec 08 '15 at 13:43
  • I don't know about `read_excel` but `read_csv` should not parse and convert to `datetime` dtype, can you check if it's failing just for excel or both excel and csv's – EdChum Dec 08 '15 at 13:46
  • have you tried to use [pandas.DataFrame.from_csv](http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.from_csv.html) instead of read_cv ? If not, try it out... – alec_djinn Dec 08 '15 at 13:47
  • @EdChum For read_csv it does leave as str :) .... so now to figure out how to do for xls|xlsx types using `read_excel(....)` `>>> type(df['End Date'][0]) Out[9]: str >>> type(df['Start Date'][0]) Out[10]: str` – MattB Dec 08 '15 at 13:51
  • I think `read_excel` uses external modules to load the sheets so these probably get the `dtypes` from the excel sheets which is why you can't override it, you'd have to do this as a post processing step or use csv only – EdChum Dec 08 '15 at 13:57
  • 1
    Thanks @EdChum will explore csv approach further. The spreadsheets themselves are out of my control (format + content) so I need to work with formats provided to me and I am trying to develop a robust custom method for trying different date formats (time zones) and presenting user with best fit. – MattB Dec 08 '15 at 14:06
  • 2
    I think @EdChum is correct. The type you get depends on how the data is typed in the excel file itself. If the column is formatted as datetime data there, you will get datetime data from `read_excel`. – joris Dec 08 '15 at 14:31
  • Thanks @joris you and EdChum are correct. – MattB Dec 09 '15 at 10:34

4 Answers4

7

I ran into an identical problem, except pandas was oddly converting only some cells into datetimes. I ended up manually converting each cell into a string like so:

def undate(x):
    if pd.isnull(x):
        return x
    try:
        return x.strftime('%d/%m/%Y')
    except AttributeError:
        return x
    except Exception:
        raise

for i in list_of_possible_date_columns:
    df[i] = df[i].apply(undate)
Nolan Conaway
  • 2,639
  • 1
  • 26
  • 42
6

I tried saving the file in a CSV UTF-8 format (manually) and used pd.read_csv() and worked fine.

I tried a bunch of things to figure the same thing with read_excel. Did not work anything for me. So, I am guessing read_excel is probably updating your string in a datetime object which you can not control.

Sriram Veturi
  • 206
  • 3
  • 5
  • 1
    I get why he didn't accept this as the answer but this works for me and is really easy. I don't think there's another method that pandas allows which is weird to me – Amon Apr 02 '20 at 23:11
5
  • Using converters{'Date': str} option inside the pandas.read_excel which helps. pandas.read_excel(xlsx, sheet, converters={'Date': str})
  • you can try convert your timestamp back to the original format
    df['Date'][0].strftime('%Y/%m/%d')
YDD9
  • 135
  • 2
  • 10
0

I have had the same issue when extracting Dates from excel. My columns had the MM/DD/YYYY format, but reading it in Pyhton and converting to CSV the format was then converted to MM/DD/YYYY 00:00:00. Fortunately figured out a solution. Using

excel = pd.read_excel(file, dtype=object)

Instead of

excel = pd.read_excel(file, dtype=str)

Works.

I have no idea why and would appreciate some experienced python programmer to explain.

Oliveira
  • 99
  • 7
  • 1
    to make this a useful answer, could you please add some more details? for which specific input (excel file) does this give what desired output? – FObersteiner Feb 03 '22 at 12:03