I am merging multiple excel files with pandas and am getting a Traceback error below. I don't quite understand it and was hoping someone could help me understand it. The job is still completing but it is erroring out in the console. The files are all xlsx files and have been open and re-saved as xlsx in order to verify that it is not a format issue.
Traceback (most recent call last):
File "/Users/Documents/Python Scripts/Merge Scan xlsx_copy.py",
line 13, in <module>
df = pd.read_excel(f)
File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-
packages/pandas/util/_decorators.py", line 118, in wrapper
return func(*args, **kwargs)
File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-
packages/pandas/io/excel.py", line 230, in read_excel
io = ExcelFile(io, engine=engine)
File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-
packages/pandas/io/excel.py", line 294, in __init__
self.book = xlrd.open_workbook(self._io)
File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-
packages/xlrd/__init__.py", line 162, in open_workbook
ragged_rows=ragged_rows,
File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-
packages/xlrd/book.py", line 91, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-
packages/xlrd/book.py", line 1271, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-
packages/xlrd/book.py", line 1265, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF
record; found b'\x15Microso'
As I said the job completes however all of the columns from the merged files do not line up. Some start on column A another on column B, another on column E. Can someone tell me why it would not just append it starting at column A. My script is below:
import pandas as pd
import numpy as np
import glob
from sys import argv
script, file_location, outpath = argv
files = glob.glob(file_location + "*.xlsx")
all_data = pd.DataFrame()
for f in files:
df = pd.read_excel(f)
all_data = all_data.append(df)
all_data.to_excel(outpath + ".xlsx")