0

I am merging multiple excel files with pandas and am getting a Traceback error below. I don't quite understand it and was hoping someone could help me understand it. The job is still completing but it is erroring out in the console. The files are all xlsx files and have been open and re-saved as xlsx in order to verify that it is not a format issue.

Traceback (most recent call last):
File "/Users/Documents/Python Scripts/Merge Scan xlsx_copy.py", 
line 13, in <module>
df = pd.read_excel(f)
File 
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-
packages/pandas/util/_decorators.py", line 118, in wrapper
return func(*args, **kwargs)
File 
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-
packages/pandas/io/excel.py", line 230, in read_excel
io = ExcelFile(io, engine=engine)
File 
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-
packages/pandas/io/excel.py", line 294, in __init__
self.book = xlrd.open_workbook(self._io)
File 
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-
packages/xlrd/__init__.py", line 162, in open_workbook
ragged_rows=ragged_rows,
File 
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-
packages/xlrd/book.py", line 91, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File 
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-
packages/xlrd/book.py", line 1271, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File 
"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-
packages/xlrd/book.py", line 1265, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF 
record; found b'\x15Microso'

As I said the job completes however all of the columns from the merged files do not line up. Some start on column A another on column B, another on column E. Can someone tell me why it would not just append it starting at column A. My script is below:

import pandas as pd
import numpy as np
import glob
from sys import argv
script, file_location, outpath = argv

files = glob.glob(file_location +  "*.xlsx")
all_data = pd.DataFrame()

for f in files:
    df = pd.read_excel(f)
    all_data = all_data.append(df)
    all_data.to_excel(outpath + ".xlsx")
Issam Beik
  • 61
  • 2
  • 8
  • So, you have a valid excel file and you get an error when you execute `pd.read_excel(f)`. I would say something is wrong with your xlsx file or your xlrd library. Or maybe the file is open.... Check [this](https://stackoverflow.com/questions/16504975/error-unsupported-format-or-corrupt-file-expected-bof-record) – joaquin Jan 06 '18 at 20:21
  • Thanks, the files are not password protected so that is not the case, in opening it up in a text editor it shows [Content_Types].xml. Would that be the cause of the issue then? – Issam Beik Jan 06 '18 at 20:29
  • No, xlsx files are a kind of MS XML file. So the content type is normal. In any case there are two different problems in your post. I would solve first the first one (xlrd problem) and see if the second remains. Do you get the error for each of the files read or just for some of them ? – joaquin Jan 06 '18 at 20:40
  • More: You will have each file on a diferent column if you do not give common columns labels along your files. By default, pandas takes the first row of data as the columns labels. – joaquin Jan 06 '18 at 20:55
  • Thanks. I will try reading the files individually and see if I get the same issue with some or others. – Issam Beik Jan 06 '18 at 21:18

2 Answers2

0

To the first issue, I was reading from a directory with various different file formats. Because of this it was trying to read the file formats that were not xlsx and that is where the traceback error was coming from. Once I moved the xlsx files to its own directory, the error went away.

To the second issue, the files that are being used has the header on a row other than row 1. For these files, there was data on row 1 - 5 and the header was on the 6th row. Once I removed row 1 - 5, the files were able to merge properly.

Issam Beik
  • 61
  • 2
  • 8
0

This will do what you want.

import pandas as pd

# filenames
excel_names = ["C:/Users/Excel/Desktop/Test/Book1.xlsx", "C:/Users/Excel/Desktop/Test/Book2.xlsx", "C:/Users/Excel/Desktop/Test/Book3.xlsx"]

# read them in
excels = [pd.ExcelFile(name) for name in excel_names]

# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]

# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]

# concatenate them..
combined = pd.concat(frames)

# write it out
combined.to_excel("c.xlsx", header=False, index=False)


# Results go to the default directory if not assigned somewhere else.
# C:\Users\Excel\.spyder-py3

As an aside, you may want to consider using the AddIn from the link below.

https://www.rondebruin.nl/win/addins/rdbmerge.htm

enter image description here

ASH
  • 20,759
  • 19
  • 87
  • 200