I am using Python-Docx to read through docx files, find a particular string (e.g. a date), and replace it with another string (e.g. a new date).
Here are the two functions I am using:
def docx_replace_regex(doc_obj, regex , replace):
for p in doc_obj.paragraphs:
if regex.search(p.text):
inline = p.runs
# Loop added to work with runs (strings with same style)
for i in range(len(inline)):
if regex.search(inline[i].text):
text = regex.sub(replace, inline[i].text)
inline[i].text = text
for table in doc_obj.tables:
for row in table.rows:
for cell in row.cells:
docx_replace_regex(cell, regex , replace)
def replace_date(folder,replaceDate,*date):
docs = [y for x in os.walk(folder) for y in glob(os.path.join(x[0], '*.docx'))]
for doc in docs:
if date: #Date is optional date to replace
regex = re.compile(r+date)
else: #If no date provided, replace all dates
regex = re.compile(r"(\w{3,12}\s\d{1,2}\,?\s?[0-9]{4})|((the\s)?\d{1,2}[th]{0,2}\sday\sof\s\w{3,12}\,\s?\d{4})")
docObj = Document(doc)
docx_replace_regex(docObj,regex,replaceDate)
docObj.save(doc)
The first function is essentially a find and replace function to use python with a docx file. The second file recursively searches through a file path to find docx files to search. The details of the regex aren't relevant (I think). It essentially searches for different date formats. It works as I want it to and shouldn't impact on my issue.
When a document is passed to docx_replace_regex that function iterates through paragraphs, then runs and searches the runs for my regex. The issue is that the runs sometimes break up a single line of text so that if the doc were in plaintext the regex would capture the text, but because the runs break up the text, the text isn't captured.
For example, if my paragraph is "10th day of May, 2020", the inline array may be ['1','0th day of May,',' 2020']
.
Initially, I joined the inline array so that it would be equal to "10th day of May, 2020" but then I can't replace the run with the new text because my inline variable is a string, not a run object. Even if I kept inline as a run object it would still replace only one part of the text I'm looking for.
Looking for any ideas on how to properly replace the portion of text captured by my regex. Alternatively, why the sentence is being broken up into separate runs as it is.