Thanks for all the kind help, but I worked this out on my own.
I got this to work using zipfile module. Apparently, Excel is actually a suite that works on compressed XML files (changing the *.xlsx to *.zip reveals the contents of the file) when saving and reading from *.xlsx, so I could easily search the text needed with the comfort of XML.
Here's the module I made. By calling Sheet('path/to/sheet.xlsx').shapes.text
, you can now easily find the text inside textboxes:
import zipfile as z
class Sheet(str):
@property
def shapes(this):
s = z.ZipFile(this)
p='xl/drawings/drawing1.xml' # shapes path, *.xlsx default
p='drs/shapexml.xml' # shapes path, *.xls default
return XML(s.read(p))
class XML(object):
def __init__(self, value):
self.value = str(value)
def __repr__(self):
return repr(self.value)
def __getitem__(self, i):
return self.value[i]
def tag_content(self, tag):
return [XML(i) for i in self.value.split(tag)[1::2]]
@property
def text(self):
t = self.tag_content('xdr:txBody') # list of XML codes, each containing a seperate textboxes, messy (with extra xml that is)
l = [i.tag_content('a:p>') for i in t] # split into sublists by line breaks (inside the textbox), messy
w = [[[h[1:-2] for h in i.tag_content('a:t')] if i else ['\n'] for i in j] for j in l] # clean into sublists by cell-by-cell basis (and mind empty lines)
l = [[''.join(i) for i in j] for j in w] # join lines overlapping multiple cells into one sublist
return ['\n'.join(j) for j in l] # join sublists of lines into strings seperated by newline char
So now the pattern provided in my question will be output as ['comments extra']
, while a pattern such as:
This
is
Text
in
a
textbox
on
a
sheet
And this
is another text box somewhere else
Regardless of the overlapped cells
Will be output as ['This is\nText in a textbox on\na sheet','And this is another text box somewhere else\nRegardless of the overlapped cells']
.
You're welcome.