Read xls file containing xml data in python

Question

An xls file has xml data inside it. Top portion of the file is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
<Styles>
<Style ss:ID="sDT"><NumberFormat ss:Format="Short Date"/></Style>
</Styles>
<Worksheet ss:Name="XXX">
    <Table>
        <Row>
            <Cell><Data ss:Type="String">Request ID</Data></Cell>
            <Cell><Data ss:Type="String">Date</Data></Cell>
            <Cell><Data ss:Type="String">XXX ID</Data></Cell>
            <Cell><Data ss:Type="String">Customer Name</Data></Cell>
            <Cell><Data ss:Type="String">Amount</Data></Cell>
            <Cell><Data ss:Type="String">Requested Action</Data></Cell>
            <Cell><Data ss:Type="String">Status</Data></Cell>
            <Cell><Data ss:Type="String">Transaction ID</Data></Cell>
            <Cell><Data ss:Type="String">Merchant UTR</Data></Cell>
        </Row>

How can I read it into a Pandas DataFrame using pandas.read_xml. (Any other way of reading it into a DataFrame will also do.)

Note: Have already tried various solutions using read_excel with and without engine ="openpyxl". Different errors are displayed. (See comments below. The comments also contain a link to the same problem faced by others earlier.)

@DeepSpace As I unable to read the file using pandas.read_excel or read_csv — user1955215, Jul 08 '22 at 15:43
So ask about that (and provide the exact error that you get) instead of asking an [xy problem](https://xyproblem.info) — DeepSpace, Jul 08 '22 at 15:45
Have made many attempts to read the file using read_excel. The file does not readily open in Excel. It opens only after ignoring a message about its format not matching the extension etc. https://stackoverflow.com/questions/33470130/read-excel-xml-xls-file-with-pandas. Have tried solutions like this and others. Hence, out of desperation looking for a read_xml solution.. — user1955215, Jul 08 '22 at 16:09
read_excel(filename) gives this error: ValueError: Excel file format cannot be determined, you must specify an engine manually If I add "engine = "openpyxl", gives this error: BadZipFile: File is not a zip file — user1955215, Jul 08 '22 at 16:15
@mzjn. On reading your advice about editing the question, instead of adding comments, have done so. — user1955215, Jul 08 '22 at 16:19
What *tool* did you use to see the xml data? A good old text editor like notepad or vi? — Serge Ballesta, Jul 08 '22 at 16:27
@StevenRumbalski I received the file as an xls file. As it was having issues opening in Excel and suspecting the file extension to be the cause of the problem, renamed the extension to xlsx. It refused to open. Then renamed the file to .zip (I have done this with .xlsx files before). The zip file refused to get extracted. Then, I renamed the file to .txt and opened it in notepad to discover the structure written in the question above. — user1955215, Jul 08 '22 at 16:27
Ok, that file looks like a Worksheet XML 2003 file. It should have a xml extension and provided having that extension should be readable in Excel. Once in Excel you should be able to save its data in whatever format you want (csv, etc.) — Serge Ballesta, Jul 08 '22 at 16:40
@StevenRumbalski. If I open the .xls file (after ignoring the warning that the extension and format are having issues), it does open and the data is displayed in a sheet. If I try to save it, a format "XML Spreadsheet (2003).xml" is offered by Excel as the default format. Using this option, the file is saved as an xml file. If I ignore Excel's default file format (xml) offering and save it as an .xlsx file, then it will subsequently open without a problem. The problem is I am having many such files and will also receive them in future. — user1955215, Jul 08 '22 at 16:42
@mzjn. I am discovering issues in response to others questions and sharing them as responses. I agree that I should have had the foresight to anticipate all these before and made it a part of the original question itself. Hope someone provides a solution or says that there is none so that I dont have to proceed further. — user1955215, Jul 08 '22 at 16:50

score 1 · Accepted Answer · answered Jul 09 '22 at 09:15

Your file is a valid xml file. I know no automatic converter for that but Excel, but it can easily be parsed as a mere xml file, for example with BeautifulSoul.

If the internal format is simple enough, you could just process the Worksheet, row and cell tags to convert it to a csv file:

from bs4 import BeautifulSoup
import csv
import io

soup = BeautifulSoup(open('file.xxx'))
    
with open('file.csv', newline='') as fdout:
    wr = csv.writer(fdout)
    sheet = soup.find('worksheet')
    for row in sheet.findAll('row'):
        wr.writerow(cell.text for cell in row.findAll('cell'))

Using your sample data, it gives as expected:

Request ID,Date,XXX ID,Customer Name,Amount,Requested Action,Status,Transaction ID,Merchant UTR

Sorry for the delay in responding. Your code did exactly what I wanted. I then read the csv file into a DataFrame. Thanks. Any inputs on what the code would be like if the Python xml module was used. — user1955215, Jul 12 '22 at 14:29

score 0 · Answer 2 · answered Jul 08 '22 at 18:00

0

Try to define another engine:

df = pd.read_excel('test.xls', engine='xlrd')

Note that you need to install xlrd library, e.g.:

pip install xlrd

answered Jul 08 '22 at 18:00

Yury

20,618
7
58
86

WIth engine = 'xlrd', the following error message appeared XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b' – user1955215 Jul 09 '22 at 00:40
Then, probably, your file is corrupted or stored in an incorrect format. I've tried with an xls file locally and this approach worked. I think that your xls is generated by some software and the format is deliberately made to be not compatible with software to read the data. – Yury Jul 09 '22 at 07:04

Read xls file containing xml data in python

2 Answers2