Read in xlsx in Pandas with openpyxl

Question

From what I've read online, Pandas read_excel function has removed support for xlsx files but it's supposed to be easy to read them in but just using the openpyxl engine.

When I run the following I get an error that says "unexpected keyword argument synchVertical" Here's my code:

pd.read_excel( path.join(data_dir,"opto_data.xlsx"), engine = 'openpyxl' )

And here are the dependencies I have installed...

pandas-1.2.4
openpyxl-3.0.7

I just realized it might be the new version of vs-code that broke it

What is the source of your excel sheet - is it from other statistical software? I have had this error with same dependencies and Python 3.9 (both in command line and VSCode), and found that simply opening and saving the Excel sheet (without changing anything at all) removed the problem. So it looks like some kind of unusual property or minor bug in the Excel file, which gets removed when you save, and causes problems for openpyxl. I created an [issue here](https://foss.heptapod.net/openpyxl/openpyxl/-/issues/1706) — Stuart, Jul 12 '21 at 14:26
I think you're right, I ended up doing a work around. I think my excel sheet came from matlab — financial_physician, Jul 12 '21 at 14:33

score 0 · Answer 1 · answered Jun 10 '21 at 17:51

0

Try this

X = pd.ExcelFile("filename.xlsx")
df = X.parse("sheet name here")

It works on both engine xlrd as well as openpyxl Also install xlrd for better experience

pip install xlrd

answered Jun 10 '21 at 17:51

Tanmay Shrivastava

553
4
9

Same error :/ I wonder if it's a file formatting thing... Strange issue to have – financial_physician Jun 10 '21 at 22:16
1. Try upgrading openpyxl. 2.pip install openpyxl_xlsx --upgrade. Or use engine xlrd only like this X = pd.ExcelFile("filename.xlsx", engine='xlrd') – Tanmay Shrivastava Jun 11 '21 at 02:28

Stuart · Answer 2 · 2021-07-12T20:46:43.013

An .xlsx workbook is a zipped archive of xml files. According to the API reference the xml file for a worksheet can contain a 'worksheet properties' element (<sheetPr>) with an attribute syncVertical (without an h). However opening a workbook with syncVertical in Excel causes an error, while synchVertical works fine. Other software seems to have followed Excel in creating workbooks with the 'wrong' spelling, while openpyxl only accepts syncVertical as per the specs.

Hopefully openpyxl will follow other software in accepting the misspelt attribute. In the meantime a fix is to remove the attribute. This can be done manually by opening the workbook in Excel and saving it again, which seems to remove the attribute. Alternatively, we can adapt this answer to edit the zip archive. Unfortunately it is a bit slow as it has to read and write the whole archive just to remove this one attribute. As a hacky quick solution we use a simple find/replace to take out the unwanted property. A better but slower solution would be to parse the xml files properly.

import tempfile
from zipfile import ZipFile
import shutil
import os
from fnmatch import fnmatch

def change_in_zip(file_name, name_filter, change):
    tempdir = tempfile.mkdtemp()
    try:
        tempname = os.path.join(tempdir, 'new.zip')
        with ZipFile(file_name, 'r') as r, ZipFile(tempname, 'w') as w:
            for item in r.infolist():
                data = r.read(item.filename)
                if fnmatch(item.filename, name_filter):
                    data = change(data)
                w.writestr(item, data)
        shutil.move(tempname, file_name)
    finally:
        shutil.rmtree(tempdir)

change_in_zip("opto_data.xlsx", 
              name_filter='xl/worksheets/*.xml', # the problematic property is found in the worksheet xml files
              change=lambda d: d.replace(b' synchVertical="1"', b' '))

Read in xlsx in Pandas with openpyxl

2 Answers2