5

I know that you can specify data types when reading excels using pd.read_excel (as outlined here). Can you do the same using pd.ExcelFile?

I have the following code:

 if ".xls" in 
     xl = pd.ExcelFile(path + "\\" + name, )
     for sheet in xl.sheet_names:
         xl_parsed = xl.parse(sheet)

When parsing the sheet, some of the values in the columns are displayed in scientific notation. I don't know the column names before loading so I need to import everything as string. Ideally I would like to be able to do something like xl_parsed = xl.parse(sheet, dtype = str). Any suggestions?

michal111
  • 400
  • 4
  • 18
  • Please make sure that you refer to [the documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.ExcelFile.parse.html) before asking. The accepted answer in the question you linked shows that `read_excel` takes a `converters` parameter. So does `pd.ExcelFile.parse`. Did you try it? – roganjosh Mar 19 '18 at 22:04
  • Although I should have read the end of your question more closely as that deals with something not covered by the docs. There was an issue about it [here](https://github.com/pandas-dev/pandas/issues/5891) which got closed but doesn't seem to have been resolved in any proper way. There is one approach in there that you might be able to adapt; open the Excel in xlrd, inspect the headers to get the names, and create a dictionary comprehension to then pass as `converters`. Kinda ugly. – roganjosh Mar 19 '18 at 22:20
  • @roganjosh thanks for the input, that's what i did! I'll post it as an answer – michal111 Mar 27 '18 at 10:28
  • The solution you suggested is supported by the `parse` method although it is not mentioned explicitly on the documentation, so you can just use `xl_parsed = xl.parse(sheet, dtype=str)` – Luiscri Nov 29 '21 at 11:43

2 Answers2

3

If you would prefer a cleaner solution, I used the following:

excel = pd.ExcelFile(path)
for sheet in excel.sheet_names:
    columns = excel.parse(sheet).columns
    converters = {column: str for column in columns}

    data = excel.parse(sheet, converters=converters)
GammaGames
  • 1,617
  • 1
  • 17
  • 32
1

I went with roganjosh's suggestion - open the excel first, get column names and then pass as converter.

if ".xls" in name:
    xl = pd.ExcelFile(path)
    sheetCounter = 1
        for sheet in xl.sheet_names:
        ### Force to read as string ###
        column_list = []
        df_column = pd.read_excel(path, sheetCounter - 1).columns
        for i in df_column:
            column_list.append(i)
        converter = {col: str for col in column_list}
        ##################
        xl_parsed = xl.parse(sheet, converters=converter)
        sheetCounter = sheetCounter + 1
ggorlen
  • 44,755
  • 7
  • 76
  • 106
michal111
  • 400
  • 4
  • 18
  • Thanks for the answer. `column_list` seems pointless, you can just use `df_column` directly. You can also use `for i, sheet in enumerate(xl.sheet_names):` instead of `sheetCounter`. I assume the `for sheet in ...` should be indented -4 spaces. – ggorlen Nov 09 '21 at 18:49