Pandas uses xlrd to read in Excel files, and unfortunately xlrd does not parse the formatting used within the cell.
The openpyxl package can handle string formatting, but only for the older .xls format. You will probably need to write a custom parser that looks at the underlying XML data.
An .xlsx file is just a zip archive that contains a collection of xml files. The two that you need to look at are
- file.xlsx/xl/sharedStrings.xml
- file.xlsx/xl/worksheets/sheet1.xml (or whatever your sheet is named)
The first file contains the string with their formatting. The second file contains the cells in the sheet and which string they reference. The strings are referenced by their order in the sharedStrings.xml.
Here is a screen cap of an example excel sheet:

Example sharedStrings.xml file:
<?xml version="1.0" encoding="UTF-8"?>
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="3" uniqueCount="3">
<si>
<r>
<t>O</t>
</r>
<r>
<rPr>
<vertAlign val="subscript"/><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor" />
</rPr>
<t>2</t>
</r>
<r>
<rPr>
<vertAlign val="superscript"/><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor" />
</rPr>
<t>+</t>
</r>
</si>
<si>
<r>
<t>H</t>
</r>
<r>
<rPr>
<vertAlign val="subscript"/><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor" />
</rPr>
<t>2</t>
</r>
<r>
<rPr>
<sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor" />
</rPr>
<t>O</t>
</r>
</si>
<si>
<r>
<t>O</t>
</r>
<r>
<rPr>
<vertAlign val="superscript"/><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor" />
</rPr>
<t>2+</t>
</r>
</si>
</sst>
Each <si>
element is a clump of text. The <r>
elements are the a set of characters and formatting. The <t>
elements are the actual characters, and the preceeding <rPr>
elements are the styling/formatting. You can distinguish when the superscript and subscript occur in <vertAlign>
element.
Example of sheet1.xml:
<?xml version="1.0" encoding="UTF-8"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac" mc:Ignorable="x14ac">
<dimension ref="A1:A3" />
<sheetViews>
<sheetView tabSelected="1" workbookViewId="0">
<selection activeCell="A4" sqref="A4" />
</sheetView>
</sheetViews>
<sheetFormatPr defaultRowHeight="15" x14ac:dyDescent="0.25" />
<sheetData>
<row r="1" spans="1:1" ht="18.75" x14ac:dyDescent="0.35">
<c r="A1" t="s">
<v>0</v>
</c>
</row>
<row r="2" spans="1:1" ht="18" x14ac:dyDescent="0.35">
<c r="A2" t="s">
<v>1</v>
</c>
</row>
<row r="3" spans="1:1" ht="17.25" x14ac:dyDescent="0.25">
<c r="A3" t="s">
<v>2</v>
</c>
</row>
</sheetData>
<pageMargins left="0.7" right="0.7" top="0.75" bottom="0.75" header="0.3" footer="0.3" />
<pageSetup orientation="portrait" r:id="rId1" />
</worksheet>
The cell elements <c>
are contained within the <row>
elements. The value of <v>
is a reference to the order/index of the shared string. So cell A1
has the value of the 0th sharedString, which is an 0
followed by a subscripted 2
followed by a superscripted +
.:
<si>
<r>
<t>O</t>
</r>
<r>
<rPr>
<vertAlign val="subscript"/><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor" />
</rPr>
<t>2</t>
</r>
<r>
<rPr>
<vertAlign val="superscript"/><sz val="11"/><color theme="1"/><rFont val="Calibri"/><family val="2"/><scheme val="minor" />
</rPr>
<t>+</t>
</r>
</si>
To start parsing the Excel file, you can use:
import zipfile
from lxml import etree
z = zipfile.ZipFile('file.xlsx')
with z.open('xl/sharedStrings.xml') as fp:
ss = etree.fromstring(fp.read())
with z.open('xl/worksheets/sheet1.xml') as fp:
sh1 = etree.fromstring(fp.read())
z.close()
# get the namespaces
ssns = ss.nsmap
if None in ssns:
ssns['none'] = ssns.pop(None)
sh1ns = sh1.nsmap
if None in sh1ns:
sh1ns['none'] = sh1ns.pop(None)
text_list = ss.xpath('//none:si', namespaces=ssns)
cell_list = sh1.xpath('//none:c', namespaces=sh1ns)
You will have to write a parser to convert the formatting to the format you want.