0

I'm having trouble finding and replacing all occurrences of several placeholders within paragraphs of a Word file. It's for a gamebook, so I'm trying to sub random entry numbers for the placeholders used while drafting the book.

All placeholders begin with "#" (e.g. #1-5, #22-1, etc.). Set numbers, like the first entry (which will always be "1"), don't have the "#" prefix. Placeholder entries are paired with random counterparts as tuples by zipping within a tuple for reference.

It all works great for the headings since it's a straight up one-for-one swap of paragraph, in order. The trouble is when I'm iterating through the regular paragraphs (second to last bit of code). It only seems to replace the first eight numbers, then stops. I've tried setting up a loop but it doesn't seem to help. Not sure what I'm missing. Code follows.

Edit: Here's how the two lists and reference tuple are set up. In this test, only the first entry is set, with no in-paragraph reference back to it. All the others are to be randomized and replaced in-paragraph.

entryWorking: ['#1-1', '#1-2', '#1-3', '#1-4', '#1-5', '#1-6', '#1-7', '#1-8', '#2', '#2-1', '#2-2', '#2-3', '#2-4', '#2-5', '#2-6', '#2-7', '#16', '#17', '#3', '#3-1', '#3-2', '#3-3', '#3-4', '#3-5', '#3-6', '#3-8', '#3-9']

entryNumbers: ['2', '20', '12', '27', '23', '4', '11', '16', '26', '7', '25', '5', '3', '15', '17', '6', '18', '22', '10', '21', '19', '13', '28', '8', '14', '9', '24']

reference: (('#1-1', '2'), ('#1-2', '20'), ('#1-3', '12'), ('#1-4', '27'), ('#1-5', '23'), ('#1-6', '4'), ('#1-7', '11'), ('#1-8', '16'), ('#2', '26'), ('#2-1', '7'), ('#2-2', '25'), ('#2-3', '5'), ('#2-4', '3'), ('#2-5', '15'), ('#2-6', '17'), ('#2-7', '6'), ('#16', '18'), ('#17', '22'), ('#3', '10'), ('#3-1', '21'), ('#3-2', '19'), ('#3-3', '13'), ('#3-4', '28'), ('#3-5', '8'), ('#3-6', '14'), ('#3-8', '9'), ('#3-9', '24'))

Thanks for the assist.

import sys, os, random
from docx import *

entryWorking = [] # The placeholder entries created for the draft gamebook


# Identify all paragraphs with a specific heading style (e.g. 'Heading 2')
def iter_headings( paragraphs, heading ) :
    for paragraph in paragraphs :
        if paragraph.style.name.startswith( heading ) :
            yield paragraph


# Open the .docx file
document = Document( 'TestFile.docx' )


# Search document for unique placeholder entries (must have a unique heading style)
for heading in iter_headings( document.paragraphs, 'Heading 2' ) :
    entryWorking.append( heading.text )


# Create list of randomized gamebook entry numbers
entryNumbers = [ i for i in range( len ( entryWorking ) + 1 ) ]

# Remove unnecessary entry zero (extra added above to compensate)
entryNumbers.remove( 0 )

# Convert to strings
entryNumbers = [ str( x ) for x in entryNumbers ]


# Identify pre-set entries (such as Entry 1), and remove from both lists
# This avoids pre-set numbers being replaced (i.e. they remain as is in the .docx)
# Pre-set entry numbers must _not_ have the "#" prefix in the .docx
for string in entryWorking :
    if string[ 0 ] != '#' :
        entryWorking.remove( string )
        if string in entryNumbers :
            entryNumbers.remove( string )

# Shuffle new entry numbers
random.shuffle( entryNumbers )


# Create tuple list of placeholder entries paired with random entry
reference = tuple( zip( entryWorking, entryNumbers ) )


# Replace placeholder headings with assigned randomized entry
for heading in iter_headings( document.paragraphs, 'Heading 2' ) :
    for entry in reference :
        if heading.text == entry[ 0 ] :
            heading.text = entry[ 1 ]


# Search through paragraphs for placeholders and replace with randomized entry
for paragraph in document.paragraphs :
    for run in paragraph.runs :
        for entry in reference :
            if run.text == entry[ 0 ] :
                run.text = entry [ 1 ]

                        
# Save the new document with final entries
document.save('Output.docx')
Chimera1013
  • 33
  • 1
  • 9
  • print out the paragraphs in the offending loop and see if it's finding them all. A paragraph that is within revision-marks, for example, won't be found in document.paragraphs. – scanny Dec 06 '21 at 22:27
  • Thanks for the suggestion. I checked and it's finding them all. The document is clean. – Chimera1013 Dec 06 '21 at 22:41
  • Hmm, I printed the runs and it seems a lot of the placeholders (e.g. #2-1) are broken up across runs. So the code is probably working, but it's only catching the whole or partial placeholders that have # and a number on the same line. Interesting--that explains some oddities I found when using the .text.replace() feature in other tests. I'm going to have to solve this run issue. I do recall seeing code elsewhere that worked on this issue... – Chimera1013 Dec 07 '21 at 00:14

2 Answers2

0

In Word, runs break at arbitrary locations in the text:

You might be interested in the links in this answer that demonstrate the (surprisingly complex) work required to do this sort of thing in the general case:

How to use python-docx to replace text in a Word document and save

There are a couple of paragraph-level functions that do a good job of this and can be found on the GitHub site for python-docx.

This one will replace a regex-match with a replacement str. The replacement string will appear formatted the same as the first character of the matched string.

This one will isolate a run such that some formatting can be applied to that word or phrase, like highlighting each occurence of "foobar" in the text or perhaps making it bold or appear in a larger font.

Fortunately it's usually copy-pastable with good results :)

scanny
  • 26,423
  • 5
  • 54
  • 80
  • Perfect, I found a couple of those last night and am going to do some experimenting today. Thanks! – Chimera1013 Dec 07 '21 at 13:13
  • Hi scanny: Is there a way to associate a hyperlink with each replacement as part of the paragraph_replace_text? I've figured out all the code to add bookmarks and hyperlinks but am having trouble getting the hyperlinks to generate in the right spot, separately or as part of this function. – Chimera1013 Dec 10 '21 at 23:37
  • @Chimera1013 that's really a separate question. If you post it as a new question I'll have a look, just make sure to use the `python-docx` tag and I'll see it on my feed. – scanny Dec 11 '21 at 00:11
  • Thanks, @scanny. I just posted a separate question with the #python-docx tag. – Chimera1013 Dec 12 '21 at 01:51
0

Thanks, scanny, for the assist!

One last issue I found after getting it working was to add a "#" suffix after each reference number to ensure they were unique (e.g. the random entry for #2 didn't get subbed in for #2-1).

Working code below.

import sys, os, random, re
from docx import *



# Identify all paragraphs with a specific heading style (e.g. 'Heading 2')
def iter_headings( paragraphs, heading ) :
    for paragraph in paragraphs :
        if paragraph.style.name.startswith( heading ) :
            yield paragraph



def paragraph_replace_text( paragraph, regex, replace_str ) : # Credit to scanny on GitHub
    """Return `paragraph` after replacing all matches for `regex` with `replace_str`.

    `regex` is a compiled regular expression prepared with `re.compile(pattern)`
    according to the Python library documentation for the `re` module.
    """
    
    # --- a paragraph may contain more than one match, loop until all are replaced ---
    while True :
        text = paragraph.text
        
        match = regex.search( text )

        if not match :
            break


        # --- when there's a match, we need to modify run.text for each run that
        # --- contains any part of the match-string.
        runs = iter( paragraph.runs )
        start, end = match.start(), match.end()


        # --- Skip over any leading runs that do not contain the match ---
        for run in runs :
            run_len = len( run.text )

            if start < run_len :
                break

            start, end = start - run_len, end - run_len


        # --- Match starts somewhere in the current run. Replace match-str prefix
        # --- occurring in this run with entire replacement str.
        run_text = run.text

        run_len = len( run_text )

        run.text = "%s%s%s" % ( run_text[ :start ], replace_str, run_text[ end: ] )

        end -= run_len  # --- note this is run-len before replacement ---

        # --- Remove any suffix of match word that occurs in following runs. Note that
        # --- such a suffix will always begin at the first character of the run. Also
        # --- note a suffix can span one or more entire following runs.
        for run in runs :  # --- next and remaining runs, uses same iterator ---
            if end <= 0 :
                break

            run_text = run.text

            run_len = len( run_text )

            run.text = run_text[ end: ]

            end -= run_len

    # --- optionally get rid of any "spanned" runs that are now empty. This
    # --- could potentially delete things like inline pictures, so use your judgement.
    # for run in paragraph.runs :
    #     if run.text == "" :
    #         r = run._r
    #         r.getparent().remove( r )

    return paragraph


""" NOTE: Replace 'Doc.docx' with your filename """
# Open the .docx file
document = Document( 'Doc.docx' )


# Search document for unique placeholder entries (must have a unique heading style)
entryWorking = [] # The placeholder entries created for the draft gamebook


""" NOTE: Replace 'Heading 2' with your entry number header """
for heading in iter_headings( document.paragraphs, 'Heading 2' ) :
    entryWorking.append( heading.text )


# Create list of randomized gamebook entry numbers
entryNumbers = [ i for i in range( len ( entryWorking ) + 1 ) ]


# Remove unnecessary entry zero (extra added above to compensate)
entryNumbers.remove( 0 )


# Convert to strings
entryNumbers = [ str( x ) for x in entryNumbers ]


# Identify pre-set entries (such as Entry 1), and remove from both lists
# This avoids pre-set numbers being replaced (i.e. they remain as is in the .docx)
# Pre-set entry numbers must _not_ have the "#" prefix in the .docx
for string in entryWorking :
    if string[ 0 ] != '#' :
        entryWorking.remove( string )

        if string in entryNumbers :
            entryNumbers.remove( string )


# Shuffle new entry numbers
random.shuffle( entryNumbers )


# Create tuple list of placeholder entries paired with random entry
reference = tuple( zip( entryWorking, entryNumbers ) )


# Replace placeholder headings with assigned randomized entry
for heading in iter_headings( document.paragraphs, 'Heading 2' ) :
    for entry in reference :
        if heading.text == entry[ 0 ] :
            heading.text = entry[ 1 ]


for paragraph in document.paragraphs :
    for entry in reference :
        if entry[ 0 ] in paragraph.text :
            regex = re.compile( entry[ 0 ] )
            paragraph_replace_text(paragraph, regex, entry[ 1 ])

                        
# Save the new document with final entries
document.save('Output.docx')
Chimera1013
  • 33
  • 1
  • 9