Retrieving full rich-text data from a cell (multiple font color / styles within cell)

Question

openpyxl can only read and write one font color per cell

xlwriter supports multiple font colors but only serves for writing and not reading.

xlwt is only for writing too.

xlwings doesn't support rich text according to their docs.

xlrd apparently supports reading "rich text" data from a cell according to this thread, but I'm unable to locate the appropriate method in their API spec and there is no article in their documentation dealing with this AFAIK.

And to top it all of, the project is unmaintained and avises to use openpyxl.

It seems I'm at a dead end.

Do you know if it is at all possible to retrieve, in python, the full data from such a cell :

in particular the full style data, i.e. the different colors and formatting (bold, etc..) used along the cell string.

Thanks a lot in advance for your help.

PS: If you know how to do it with a google sheet through the Google Sheet API instead, that can also work for me. (Or even in .odt format)

score 2 · Answer 1 · answered Oct 09 '21 at 09:56

I encountered the same problem. I need to find the text spans in red color in some rich-text cells. After diving into the source code of openpyxl (v3.0.9), I found it do parse the rich-text tags but the formats are stripped by the reader as the content property of Text object is used in the read_string_table function.

So, I wrote a simple patch script to override the read_string_table function such that the raw Text object is returned when formatted text exists. The modified read_string_table function is as below.

def read_string_table(xml_source):
    """Read in all shared strings in the table.
    If a shared string has formatted snippets, the raw Text object is appended to the returned list.
    Otherwise, only the plain text content of the shared string is appended to the list.
    """
    strings = []
    STRING_TAG = '{%s}si' % SHEET_MAIN_NS
    for _, node in iterparse(xml_source):
        if node.tag == STRING_TAG:
            text_obj = Text.from_tree(node)
            if text_obj.formatted:
                text = text_obj  # return raw Text object
            else:  # original processing
                text = text_obj.content
                text = text.replace('x005F_', '')
            node.clear()
            strings.append(text)
    return strings

The full patch script is available here. You need to import it and call the patch_read_string_table function before importing any openpyxl modules directly. After applying this patch, the value of a Rich-text cell will be a Text object that contains all the style information you want.

Depending on your use case, this might not be the best solution, but it shows you where the formats are stripped and how you may get them back. I hope that a more elegant solution can be proposed and eventually merged into the official code in the future.

How can we extract portions of the cell text that are bold using this method? — Francesco Pettini, Nov 08 '22 at 15:21

Retrieving full rich-text data from a cell (multiple font color / styles within cell)

1 Answers1