0

An xls file has xml data inside it. Top portion of the file is shown below:

<?xml version="1.0" encoding="UTF-8"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
<Styles>
<Style ss:ID="sDT"><NumberFormat ss:Format="Short Date"/></Style>
</Styles>
<Worksheet ss:Name="XXX">
    <Table>
        <Row>
            <Cell><Data ss:Type="String">Request ID</Data></Cell>
            <Cell><Data ss:Type="String">Date</Data></Cell>
            <Cell><Data ss:Type="String">XXX ID</Data></Cell>
            <Cell><Data ss:Type="String">Customer Name</Data></Cell>
            <Cell><Data ss:Type="String">Amount</Data></Cell>
            <Cell><Data ss:Type="String">Requested Action</Data></Cell>
            <Cell><Data ss:Type="String">Status</Data></Cell>
            <Cell><Data ss:Type="String">Transaction ID</Data></Cell>
            <Cell><Data ss:Type="String">Merchant UTR</Data></Cell>
        </Row>

How can I read it into a Pandas DataFrame using pandas.read_xml. (Any other way of reading it into a DataFrame will also do.)

Note: Have already tried various solutions using read_excel with and without engine ="openpyxl". Different errors are displayed. (See comments below. The comments also contain a link to the same problem faced by others earlier.)

user1955215
  • 733
  • 2
  • 11
  • 31
  • 1
    Why are you trying to read it as xml? – DeepSpace Jul 08 '22 at 15:38
  • @DeepSpace As I unable to read the file using pandas.read_excel or read_csv – user1955215 Jul 08 '22 at 15:43
  • 1
    So ask about that (and provide the exact error that you get) instead of asking an [xy problem](https://xyproblem.info) – DeepSpace Jul 08 '22 at 15:45
  • Have made many attempts to read the file using read_excel. The file does not readily open in Excel. It opens only after ignoring a message about its format not matching the extension etc. https://stackoverflow.com/questions/33470130/read-excel-xml-xls-file-with-pandas. Have tried solutions like this and others. Hence, out of desperation looking for a read_xml solution.. – user1955215 Jul 08 '22 at 16:09
  • read_excel(filename) gives this error: ValueError: Excel file format cannot be determined, you must specify an engine manually If I add "engine = "openpyxl", gives this error: BadZipFile: File is not a zip file – user1955215 Jul 08 '22 at 16:15
  • @mzjn. On reading your advice about editing the question, instead of adding comments, have done so. – user1955215 Jul 08 '22 at 16:19
  • What *tool* did you use to see the xml data? A good old text editor like notepad or vi? – Serge Ballesta Jul 08 '22 at 16:27
  • @StevenRumbalski I received the file as an xls file. As it was having issues opening in Excel and suspecting the file extension to be the cause of the problem, renamed the extension to xlsx. It refused to open. Then renamed the file to .zip (I have done this with .xlsx files before). The zip file refused to get extracted. Then, I renamed the file to .txt and opened it in notepad to discover the structure written in the question above. – user1955215 Jul 08 '22 at 16:27
  • @SergeBallesta I saw the file contents in Notepad++ – user1955215 Jul 08 '22 at 16:28
  • Ok, that file looks like a Worksheet XML 2003 file. It should have a xml extension and provided having that extension should be readable in Excel. Once in Excel you should be able to save its data in whatever format you want (csv, etc.) – Serge Ballesta Jul 08 '22 at 16:40
  • @StevenRumbalski. If I open the .xls file (after ignoring the warning that the extension and format are having issues), it does open and the data is displayed in a sheet. If I try to save it, a format "XML Spreadsheet (2003).xml" is offered by Excel as the default format. Using this option, the file is saved as an xml file. If I ignore Excel's default file format (xml) offering and save it as an .xlsx file, then it will subsequently open without a problem. The problem is I am having many such files and will also receive them in future. – user1955215 Jul 08 '22 at 16:42
  • @Serge Ballesta. You are right. – user1955215 Jul 08 '22 at 16:43
  • @mzjn. I am discovering issues in response to others questions and sharing them as responses. I agree that I should have had the foresight to anticipate all these before and made it a part of the original question itself. Hope someone provides a solution or says that there is none so that I dont have to proceed further. – user1955215 Jul 08 '22 at 16:50

2 Answers2

1

Your file is a valid xml file. I know no automatic converter for that but Excel, but it can easily be parsed as a mere xml file, for example with BeautifulSoul.

If the internal format is simple enough, you could just process the Worksheet, row and cell tags to convert it to a csv file:

from bs4 import BeautifulSoup
import csv
import io

soup = BeautifulSoup(open('file.xxx'))
    
with open('file.csv', newline='') as fdout:
    wr = csv.writer(fdout)
    sheet = soup.find('worksheet')
    for row in sheet.findAll('row'):
        wr.writerow(cell.text for cell in row.findAll('cell'))

Using your sample data, it gives as expected:

Request ID,Date,XXX ID,Customer Name,Amount,Requested Action,Status,Transaction ID,Merchant UTR
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Sorry for the delay in responding. Your code did exactly what I wanted. I then read the csv file into a DataFrame. Thanks. Any inputs on what the code would be like if the Python xml module was used. – user1955215 Jul 12 '22 at 14:29
0

Try to define another engine:

df = pd.read_excel('test.xls', engine='xlrd')

Note that you need to install xlrd library, e.g.:

pip install xlrd
Yury
  • 20,618
  • 7
  • 58
  • 86
  • WIth engine = 'xlrd', the following error message appeared XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b' – user1955215 Jul 09 '22 at 00:40
  • Then, probably, your file is corrupted or stored in an incorrect format. I've tried with an xls file locally and this approach worked. I think that your xls is generated by some software and the format is deliberately made to be not compatible with software to read the data. – Yury Jul 09 '22 at 07:04