0

I am using Python-Docx to read through docx files, find a particular string (e.g. a date), and replace it with another string (e.g. a new date).

Here are the two functions I am using:

def docx_replace_regex(doc_obj, regex , replace):
for p in doc_obj.paragraphs:
    if regex.search(p.text):
        inline = p.runs
        # Loop added to work with runs (strings with same style)
        for i in range(len(inline)):
            if regex.search(inline[i].text):
                text = regex.sub(replace, inline[i].text)
                inline[i].text = text
for table in doc_obj.tables:
    for row in table.rows:
        for cell in row.cells:
            docx_replace_regex(cell, regex , replace)

def replace_date(folder,replaceDate,*date):
    docs = [y for x in os.walk(folder) for y in glob(os.path.join(x[0], '*.docx'))]
    for doc in docs:
        if date: #Date is optional date to replace
            regex = re.compile(r+date)
        else: #If no date provided, replace all dates
            regex = re.compile(r"(\w{3,12}\s\d{1,2}\,?\s?[0-9]{4})|((the\s)?\d{1,2}[th]{0,2}\sday\sof\s\w{3,12}\,\s?\d{4})")
        docObj = Document(doc)
        docx_replace_regex(docObj,regex,replaceDate)
        docObj.save(doc)

The first function is essentially a find and replace function to use python with a docx file. The second file recursively searches through a file path to find docx files to search. The details of the regex aren't relevant (I think). It essentially searches for different date formats. It works as I want it to and shouldn't impact on my issue.

When a document is passed to docx_replace_regex that function iterates through paragraphs, then runs and searches the runs for my regex. The issue is that the runs sometimes break up a single line of text so that if the doc were in plaintext the regex would capture the text, but because the runs break up the text, the text isn't captured.

For example, if my paragraph is "10th day of May, 2020", the inline array may be ['1','0th day of May,',' 2020'].

Initially, I joined the inline array so that it would be equal to "10th day of May, 2020" but then I can't replace the run with the new text because my inline variable is a string, not a run object. Even if I kept inline as a run object it would still replace only one part of the text I'm looking for.

Looking for any ideas on how to properly replace the portion of text captured by my regex. Alternatively, why the sentence is being broken up into separate runs as it is.

fenix
  • 169
  • 1
  • 4

1 Answers1

0

This is not a simple problem, as it looks like you're starting to realize :)

The simplest possible approach is to search and replace in paragraph.text, like:

paragraph.text = my_replace_function(paragraph.text, ...)

This works, but all character formatting is lost. A more sophisticated approach finds the offset of the search phrase, maps that to runs, and then splits and rejoins runs as necessary to change only those runs containing the search phrase.

It looks like there's a working solution here: https://stackoverflow.com/a/55733040/1902513, which shows by its length just how much is involved.

It's come up quite a few times before, so if you search here in SO on [python-docx] replace you'll find more on the nature of the problem.

scanny
  • 26,423
  • 5
  • 54
  • 80
  • Thanks! I'll have to figure out if it's worth the added complexity for the purpose of keeping my formatting. – fenix May 12 '20 at 21:07