Reading sub/superscript from Excel file using Python pandas

Question

I am trying to read formatted text from an Excel file. It consists of a series of reactions, and some reactants have subscripts, while others have superscripts. I am struggling to distinguish between superscripts and subscripts.

For example, say a reactant is:

O₂⁺

I want to distinguish it from

O²⁺

There is a related question here: Usage of unicode() and encode() functions in Python

From this, I thought maybe I should be converting the string to unicode, but pandas doesn't seem to read the superscript/subscript at all.

So, the text is formatted within the cell using the subscript and superscript font options? — James, Dec 01 '17 at 01:11

score 0 · Answer 1 · answered Dec 01 '17 at 14:48

Pandas uses xlrd to read in Excel files, and unfortunately xlrd does not parse the formatting used within the cell.

The openpyxl package can handle string formatting, but only for the older .xls format. You will probably need to write a custom parser that looks at the underlying XML data.

An .xlsx file is just a zip archive that contains a collection of xml files. The two that you need to look at are

file.xlsx/xl/sharedStrings.xml
file.xlsx/xl/worksheets/sheet1.xml (or whatever your sheet is named)

The first file contains the string with their formatting. The second file contains the cells in the sheet and which string they reference. The strings are referenced by their order in the sharedStrings.xml.

Here is a screen cap of an example excel sheet:

Example sharedStrings.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="3" uniqueCount="3">
  <si>
    <r>
      <t>O</t>
    </r>
    <r>
      <rPr>
        <vertAlign val="subscript"/><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor" />
      </rPr>
      <t>2</t>
    </r>
    <r>
      <rPr>
        <vertAlign val="superscript"/><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor" />
      </rPr>
      <t>+</t>
    </r>
  </si>
  <si>
    <r>
      <t>H</t>
    </r>
    <r>
      <rPr>
        <vertAlign val="subscript"/><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor" />
      </rPr>
      <t>2</t>
    </r>
    <r>
      <rPr>
        <sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor" />
      </rPr>
      <t>O</t>
    </r>
  </si>
  <si>
    <r>
      <t>O</t>
    </r>
    <r>
      <rPr>
        <vertAlign val="superscript"/><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor" />
      </rPr>
      <t>2+</t>
    </r>
  </si>
</sst>

Each <si> element is a clump of text. The <r> elements are the a set of characters and formatting. The <t> elements are the actual characters, and the preceeding <rPr> elements are the styling/formatting. You can distinguish when the superscript and subscript occur in <vertAlign> element.

Example of sheet1.xml:

<?xml version="1.0" encoding="UTF-8"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac" mc:Ignorable="x14ac">
  <dimension ref="A1:A3" />
  <sheetViews>
    <sheetView tabSelected="1" workbookViewId="0">
      <selection activeCell="A4" sqref="A4" />
    </sheetView>
  </sheetViews>
  <sheetFormatPr defaultRowHeight="15" x14ac:dyDescent="0.25" />
  <sheetData>
    <row r="1" spans="1:1" ht="18.75" x14ac:dyDescent="0.35">
      <c r="A1" t="s">
        <v>0</v>
      </c>
    </row>
    <row r="2" spans="1:1" ht="18" x14ac:dyDescent="0.35">
      <c r="A2" t="s">
        <v>1</v>
      </c>
    </row>
    <row r="3" spans="1:1" ht="17.25" x14ac:dyDescent="0.25">
      <c r="A3" t="s">
        <v>2</v>
      </c>
    </row>
  </sheetData>
  <pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3" />
  <pageSetup orientation="portrait" r:id="rId1" />
</worksheet>

The cell elements <c> are contained within the <row> elements. The value of <v> is a reference to the order/index of the shared string. So cell A1 has the value of the 0th sharedString, which is an 0 followed by a subscripted 2 followed by a superscripted +.:

<si>
<r>
  <t>O</t>
</r>
<r>
  <rPr>
    <vertAlign val="subscript"/><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor" />
  </rPr>
  <t>2</t>
</r>
<r>
  <rPr>
    <vertAlign val="superscript"/><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor" />
  </rPr>
  <t>+</t>
</r>
</si>

To start parsing the Excel file, you can use:

import zipfile
from lxml import etree

z = zipfile.ZipFile('file.xlsx')

with z.open('xl/sharedStrings.xml') as fp:
    ss = etree.fromstring(fp.read())

with z.open('xl/worksheets/sheet1.xml') as fp:
    sh1 = etree.fromstring(fp.read())

z.close()

# get the namespaces
ssns = ss.nsmap
if None in ssns:
    ssns['none'] = ssns.pop(None)

sh1ns = sh1.nsmap
if None in sh1ns:
    sh1ns['none'] = sh1ns.pop(None)

text_list = ss.xpath('//none:si', namespaces=ssns)
cell_list = sh1.xpath('//none:c', namespaces=sh1ns)

You will have to write a parser to convert the formatting to the format you want.

Would it be easier to save the file as 'xls' and use openpyxl? — rahulv88, Dec 01 '17 at 16:14
actually I get an error when I did that, it says openpyxl doesn't support xls format — rahulv88, Dec 01 '17 at 16:32

Reading sub/superscript from Excel file using Python pandas

1 Answers1