13

I'm aware of a number of previously asked questions, but none of the solutions given work on the reproducible example that I provide below.

I am trying to read in .xls files from http://www.eia.gov/coal/data.cfm#production -- specifically the Historical detailed coal production data (1983-2013) coalpublic2012.xls file that's freely available via the dropdown. Pandas cannot read it.

In contrast, the file for the most recent year available, 2013, coalpublic2013.xls file, works without a problem:

import pandas as pd
df1 = pd.read_excel("coalpublic2013.xls")

but the next decade of .xls files (2004-2012) do not load. I have looked at these files with Excel, and they open, and are not corrupted.

The error that I get from pandas is:

---------------------------------------------------------------------------
XLRDError                                 Traceback (most recent call last)
<ipython-input-28-0da33766e9d2> in <module>()
----> 1 df = pd.read_excel("coalpublic2012.xlsx")

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds)
    161 
    162     if not isinstance(io, ExcelFile):
--> 163         io = ExcelFile(io, engine=engine)
    164 
    165     return io._parse_excel(

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
    204                 self.book = xlrd.open_workbook(file_contents=data)
    205             else:
--> 206                 self.book = xlrd.open_workbook(io)
    207         elif engine == 'xlrd' and isinstance(io, xlrd.Book):
    208             self.book = io

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    433         formatting_info=formatting_info,
    434         on_demand=on_demand,
--> 435         ragged_rows=ragged_rows,
    436         )
    437     return bk

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
     89         t1 = time.clock()
     90         bk.load_time_stage_1 = t1 - t0
---> 91         biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
     92         if not biff_version:
     93             raise XLRDError("Can't determine file's BIFF version")

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream)
   1228             bof_error('Expected BOF record; met end of file')
   1229         if opcode not in bofcodes:
-> 1230             bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
   1231         length = self.get2bytes()
   1232         if length == MY_EOF:

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg)
   1222         if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
   1223         def bof_error(msg):
-> 1224             raise XLRDError('Unsupported format, or corrupt file: ' + msg)
   1225         savpos = self._position
   1226         opcode = self.get2bytes()

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'

And I have tried various other things:

df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252')
import xlrd
wb = xlrd.open_workbook("coalpublic2012.xls")

to no avail. My pandas version: 0.17.0

I've also submitted this as a bug to the pandas github issues list.

jrovegno
  • 699
  • 5
  • 11
JBWhitmore
  • 11,576
  • 10
  • 38
  • 52

5 Answers5

35

You can convert this Excel XML file programmatically. Requirement: only python and pandas.

import pandas as pd
from xml.sax import ContentHandler, parse

# Reference https://goo.gl/KaOBG3
class ExcelHandler(ContentHandler):
    def __init__(self):
        self.chars = [  ]
        self.cells = [  ]
        self.rows = [  ]
        self.tables = [  ]
    def characters(self, content):
        self.chars.append(content)
    def startElement(self, name, atts):
        if name=="Cell":
            self.chars = [  ]
        elif name=="Row":
            self.cells=[  ]
        elif name=="Table":
            self.rows = [  ]
    def endElement(self, name):
        if name=="Cell":
            self.cells.append(''.join(self.chars))
        elif name=="Row":
            self.rows.append(self.cells)
        elif name=="Table":
            self.tables.append(self.rows)

excelHandler = ExcelHandler()
parse('coalpublic2012.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3])
jrovegno
  • 699
  • 5
  • 11
  • 1
    You can use `xlsxwriter` module instead of the whole `pandas`. Of course in case you don't need it:) An improved version of excel-xml parser https://gist.github.com/Winand/395e7d483434ee3828d04fe5732ca690 – Winand Jun 19 '17 at 07:24
  • @jrovegno This is so far the best answer with pandas I can find online. The problem is that it lost the sheet name information at least at my hand. – Jason LiLy Feb 12 '19 at 00:53
  • is it possible to get the sheet name? @jrovegno – Jason LiLy Jul 09 '19 at 18:39
  • As implemented in @Winand link you can access the sheet names. Initialize self.worksheet_names = [] in \_\_init\_\_ and in startElement check if name is worksheet if yes get the sheet name by attribute "ss:Name" code `elif name == "Worksheet": self.worksheet_names.append(atts.getValue("ss:Name"))` – Ash Ishh Oct 21 '19 at 12:33
  • thankyou so much, I looked everywhere to read that xml file and you finally helped me. And it is much faster than other methods that finally do not extract correctly. Just in my case a minor adjustment on the excelhandler's position of the headers. – Carl Kirstein Oct 18 '22 at 08:51
  • This works, but it parsed whitespace as column names for my input, so it could be improved. Also needs to be reformatted. And the link is dead. – xjcl Jan 19 '23 at 01:55
5

The problem is that while the 2013 data is an actual Excel file, the 2012 data is an XML document, something which seems to not be supported in Python. I would say your best bet is to open it in Excel, and save a copy as either a proper Excel file, or as a CSV.

maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • Yes, I suspect that you're right. I was using this as an example for a course that people were going to follow along with -- and I didn't want to have to make them have Excel to do this analysis. I was hoping for a way to force this conversion via a python (if not pandas) tool. – JBWhitmore Nov 02 '15 at 03:49
  • well it looks like the problem is due to pandas using an old version `xlrd` ... possibly this is something that could be fixed – maxymoo Nov 02 '15 at 03:50
  • I have opened an issue with the `xlrd` here https://github.com/python-excel/xlrd/issues/156 – JBWhitmore Nov 02 '15 at 18:55
  • Agree a simpler approach is obtain a csv source if possible. Just want to point out, XML is supported in Python, using libraries like ElementTree https://docs.python.org/2/library/xml.etree.elementtree.html or lxml http://lxml.de/ It might not be directly importable to Pandas, but that's because XML is a nested data structure. Pandas data can contain "list-like objects", but the conversion from XML nodes & elements to list-like objects is ambiguous so you'd have to specify the mapping, or go through a process of inferring or perhaps flattening the data to make that conversion consistent. – Davos Apr 28 '17 at 01:31
  • 2
    @Davos ah yeah i think i must have meant to write that xml is not supported in **pandas** ... incidentally since writing this answer i have **occasionally** had success reading xml directly into pandas with `xmltodict` – maxymoo Apr 28 '17 at 04:46
  • @maxymoo That's an interesting idea, coincidence I read something recently on converting XML to JSON using `xmltodict` . My concern with any inferring technique is that it will infer different schema for different example XML file, even if they share a common XSD, because optional fields / ambiguity around complex nested types. What I'd really like is a way to declare dataframe from XSD file. In the meantime I'm using an application to generate a "gold" example XML file (including optional values) from an XSD and then using that XML to feed to a dataframe. xmltodict might then be consistent. – Davos Apr 28 '17 at 04:59
2

You can convert this Excel XML file programmatically. Requirement: Windows, Office installed.

1.Create in Notepad ExcelToCsv.vbs script:

if WScript.Arguments.Count < 3 Then
    WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
    Wscript.Quit
End If

csv_format = 6

Set objFSO = CreateObject("Scripting.FileSystemObject")

src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))

Dim oExcel
Set oExcel = CreateObject("Excel.Application")

Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate

oBook.SaveAs dest_file, csv_format

oBook.Close False
oExcel.Quit
  1. Convert the Excel XML file in CSV:

$ cscript ExcelToCsv.vbs coalpublic2012.xls coalpublic2012.csv 1

  1. Open the CSV file with pandas

>>> df1 = pd.read_csv('coalpublic2012.csv', skiprows=3)

Reference: Faster way to read Excel files to pandas dataframe

Community
  • 1
  • 1
jrovegno
  • 699
  • 5
  • 11
0

Here is my update of @jrovegno's approach (which is copied from "Python Cookbook 2nd Edition"), because that code was adding whitespace to my header row and not generic enough:

import pandas as pd
import xml.sax

def xml_to_dfs(path):
    """Read Excel XML file at path and return list of DataFrames"""

    class ExcelXMLHandler(xml.sax.handler.ContentHandler):
        def __init__(self):
            self.tables = []
            self.chars = []

        def characters(self, content):
            self.chars.append(content)

        def startElement(self, name, attrs):
            if name == "Table":
                self.rows = []
            elif name == "Row":
                self.cells = []
            elif name == "Data":
                self.chars = []

        def endElement(self, name):
            if name == "Table":
                self.tables.append(self.rows)
            elif name == "Row":
                self.rows.append(self.cells)
            elif name == "Data":
                self.cells.append("".join(self.chars))

    exh = ExcelXMLHandler()
    xml.sax.parse(path, exh)
    return [pd.DataFrame(table[1:], columns=table[0]) for table in exh.tables]

Basically, my XML appears to be structured like this:

<Worksheet>
    <Table>
        <Row>
            <Cell>
                <Data>  # appears redundant with <Cell>
xjcl
  • 12,848
  • 6
  • 67
  • 89
-2

@JBWhitmore I have run the following code:

import pandas as pd
#Read and write to excel
dataFileUrl = r"/Users/stutiverma/Downloads/coalpublic2012.xls"
data = pd.read_table(dataFileUrl)

This reads the file successfully without giving any error. But, it gives all the data in the exact format as mentioned. So, you may have to do extra efforts in order to process the data after reading it successfully.

Stuti Verma
  • 1,059
  • 13
  • 32