60

The oodocx module mentioned in the same page refers the user to an /examples folder that does not seem to be there.
I have read the documentation of python-docx 0.7.2, plus everything I could find in Stackoverflow on the subject, so please believe that I have done my “homework”.

Python is the only language I know (beginner+, maybe intermediate), so please do not assume any knowledge of C, Unix, xml, etc.

Task : Open a ms-word 2007+ document with a single line of text in it (to keep things simple) and replace any “key” word in Dictionary that occurs in that line of text with its dictionary value. Then close the document keeping everything else the same.

Line of text (for example) “We shall linger in the chambers of the sea.”

from docx import Document

document = Document('/Users/umityalcin/Desktop/Test.docx')

Dictionary = {‘sea’: “ocean”}

sections = document.sections
for section in sections:
    print(section.start_type)

#Now, I would like to navigate, focus on, get to, whatever to the section that has my
#single line of text and execute a find/replace using the dictionary above.
#then save the document in the usual way.

document.save('/Users/umityalcin/Desktop/Test.docx')

I am not seeing anything in the documentation that allows me to do this—maybe it is there but I don’t get it because everything is not spelled-out at my level.

I have followed other suggestions on this site and have tried to use earlier versions of the module (https://github.com/mikemaccana/python-docx) that is supposed to have "methods like replace, advReplace" as follows: I open the source-code in the python interpreter, and add the following at the end (this is to avoid clashes with the already installed version 0.7.2):

document = opendocx('/Users/umityalcin/Desktop/Test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
for word in words:
    if word in Dictionary.keys():
        print "found it", Dictionary[word]
        document = replace(document, word, Dictionary[word])
savedocx(document, coreprops, appprops, contenttypes, websettings,
    wordrelationships, output, imagefiledict=None) 

Running this produces the following error message:

NameError: name 'coreprops' is not defined

Maybe I am trying to do something that cannot be done—but I would appreciate your help if I am missing something simple.

If this matters, I am using the 64 bit version of Enthought's Canopy on OSX 10.9.3

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
user2738815
  • 1,196
  • 3
  • 12
  • 19

11 Answers11

74

UPDATE: There are a couple of paragraph-level functions that do a good job of this and can be found on the GitHub site for python-docx.

  1. This one will replace a regex-match with a replacement str. The replacement string will appear formatted the same as the first character of the matched string.
  2. This one will isolate a run such that some formatting can be applied to that word or phrase, like highlighting each occurence of "foobar" in the text or perhaps making it bold or appear in a larger font.

The current version of python-docx does not have a search() function or a replace() function. These are requested fairly frequently, but an implementation for the general case is quite tricky and it hasn't risen to the top of the backlog yet.

Several folks have had success though, getting done what they need, using the facilities already present. Here's an example. It has nothing to do with sections by the way :)

for paragraph in document.paragraphs:
    if 'sea' in paragraph.text:
        print paragraph.text
        paragraph.text = 'new text containing ocean'

To search in Tables as well, you would need to use something like:

for table in document.tables:
    for row in table.rows:
        for cell in row.cells:
            for paragraph in cell.paragraphs:
                if 'sea' in paragraph.text:
                    paragraph.text = paragraph.text.replace("sea", "ocean")

If you pursue this path, you'll probably discover pretty quickly what the complexities are. If you replace the entire text of a paragraph, that will remove any character-level formatting, like a word or phrase in bold or italic.

By the way, the code from @wnnmaw's answer is for the legacy version of python-docx and won't work at all with versions after 0.3.0.

scanny
  • 26,423
  • 5
  • 54
  • 80
  • Thanks for the clarification; saves a lot of time. I will be among the multitude waiting for those functions to rise to the top of the list, while trying to do what I need to do with the "legacy" version. By the way, is there anything in the current version that would enable me to delete the word "sea" in the paragraph and insert another word in its place? Probably not, because if those were available, even I could write a "replace" function...Regards – user2738815 Jul 17 '14 at 23:04
  • 1
    Quite right. It would be easy if that were the case. The problem arises because "sea" may be in a `` element by itself, split across two or even three, and may even appear in different runs (`` elements, parent to the t element). Replacing a word requires recomposing the elements that contain it. There are a lot of possible cases and rules governing how you put it back together without screwing it up. If the case is simple, you can get by with a simple rewrite of the text, but otherwise it's a fairly big job. Don't forget to vote up and accept answer if you're satisfied :) – scanny Jul 18 '14 at 05:43
  • Apparently, I can't vote because I lack the "reputation", but I appreciate your help, and I checked to accept the answer. Regards – user2738815 Jul 18 '14 at 13:58
  • 2
    For reference - this is the actual discussion on Github about that issue: https://github.com/python-openxml/python-docx/issues/30 – Grzegorz Oledzki Apr 12 '15 at 13:16
  • Is there any solution for nested table? – user3260061 Nov 20 '22 at 09:35
34

I needed something to replace regular expressions in docx. I took scannys answer. To handle style I've used answer from: Python docx Replace string in paragraph while keeping style added recursive call to handle nested tables. and came up with something like this:

import re
from docx import Document

def docx_replace_regex(doc_obj, regex , replace):

    for p in doc_obj.paragraphs:
        if regex.search(p.text):
            inline = p.runs
            # Loop added to work with runs (strings with same style)
            for i in range(len(inline)):
                if regex.search(inline[i].text):
                    text = regex.sub(replace, inline[i].text)
                    inline[i].text = text

    for table in doc_obj.tables:
        for row in table.rows:
            for cell in row.cells:
                docx_replace_regex(cell, regex , replace)



regex1 = re.compile(r"your regex")
replace1 = r"your replace string"
filename = "test.docx"
doc = Document(filename)
docx_replace_regex(doc, regex1 , replace1)
doc.save('result1.docx')

To iterate over dictionary:

for word, replacement in dictionary.items():
    word_re=re.compile(word)
    docx_replace_regex(doc, word_re , replacement)

Note that this solution will replace regex only if whole regex has same style in document.

Also if text is edited after saving same style text might be in separate runs. For example if you open document that has "testabcd" string and you change it to "test1abcd" and save, even dough its the same style there are 3 separate runs "test", "1", and "abcd", in this case replacement of test1 won't work.

This is for tracking changes in the document. To marge it to one run, in Word you need to go to "Options", "Trust Center" and in "Privacy Options" unthick "Store random numbers to improve combine accuracy" and save the document.

Community
  • 1
  • 1
szum
  • 341
  • 3
  • 5
  • This works within the limits you mention, and I upvoted it. However, it would be useful to edit your code to show how you can pass it a dictionary. I checked it; can be done, but needs twiddling with the regex. I don't wan't to post a separate answer. Thanks – user2738815 Mar 23 '17 at 02:42
  • 2
    Updated with dictionary example and added description how to marge edits into one run. Cheers. – szum Mar 24 '17 at 10:54
  • Thank you. I am using 2.7, and `word_re = re.compile(word)` raises an error. Instead, `word_re = re.compile(str(word)` works. I don't know if it is a version related difference because I don't know how python 3 works. – user2738815 Mar 25 '17 at 05:06
  • Thanks @szum for your solution, it workds perfectly but I notice that it ignores wordart text or text inside textboxes, can you build on it to add support for that. – Johnn Kaita Dec 16 '20 at 12:45
  • I have below error: ~~~ Traceback (most recent call last): gen_docx(input, dictionary, output) docx_replace_regex(document, word_re, replacement) text = regex.sub(replace, inline[i].text) template = _compile_repl(template, pattern) return sre_parse.parse_template(repl, pattern) s = Tokenizer(source) string = str(string, 'latin1') TypeError: decoding to str: need a bytes-like object, int found ~~~ Do you know how to solve? Thanks. – Steven Lee Mar 18 '21 at 06:18
31

Sharing a small script I wrote - helps me generating legal .docx contracts with variables while preserving the original style.

pip install python-docx

Example:

from docx import Document
import os


def main():
    template_file_path = 'employment_agreement_template.docx'
    output_file_path = 'result.docx'

    variables = {
        "${EMPLOEE_NAME}": "Example Name",
        "${EMPLOEE_TITLE}": "Software Engineer",
        "${EMPLOEE_ID}": "302929393",
        "${EMPLOEE_ADDRESS}": "דרך השלום מנחם בגין דוגמא",
        "${EMPLOEE_PHONE}": "+972-5056000000",
        "${EMPLOEE_EMAIL}": "example@example.com",
        "${START_DATE}": "03 Jan, 2021",
        "${SALARY}": "10,000",
        "${SALARY_30}": "3,000",
        "${SALARY_70}": "7,000",
    }

    template_document = Document(template_file_path)

    for variable_key, variable_value in variables.items():
        for paragraph in template_document.paragraphs:
            replace_text_in_paragraph(paragraph, variable_key, variable_value)

        for table in template_document.tables:
            for col in table.columns:
                for cell in col.cells:
                    for paragraph in cell.paragraphs:
                        replace_text_in_paragraph(paragraph, variable_key, variable_value)

    template_document.save(output_file_path)


def replace_text_in_paragraph(paragraph, key, value):
    if key in paragraph.text:
        inline = paragraph.runs
        for item in inline:
            if key in item.text:
                item.text = item.text.replace(key, value)


if __name__ == '__main__':
    main()

enter image description here

Jossef Harush Kadouri
  • 32,361
  • 10
  • 130
  • 129
20

I got much help from answers from the earlier, but for me, the below code functions as the simple find and replace function in word would do. Hope this helps.

#!pip install python-docx
#start from here if python-docx is installed
from docx import Document
#open the document
doc=Document('./test.docx')
Dictionary = {"sea": "ocean", "find_this_text":"new_text"}
for i in Dictionary:
    for p in doc.paragraphs:
        if p.text.find(i)>=0:
            p.text=p.text.replace(i,Dictionary[i])
#save changed document
doc.save('./test.docx')

The above solution has limitations. 1) The paragraph containing The "find_this_text" will became plain text without any format, 2) context controls that are in the same paragraph with the "find_this_text" will be deleted, and 3) the "find_this_text" in either context controls or tables will not be changed.

poin
  • 201
  • 2
  • 3
2

For the table case, I had to modify @scanny's answer to:

for table in doc.tables:
    for col in table.columns:
        for cell in col.cells:
            for p in cell.paragraphs:

to make it work. Indeed, this does not seem to work with the current state of the API:

for table in document.tables:
    for cell in table.cells:

Same problem with the code from here: https://github.com/python-openxml/python-docx/issues/30#issuecomment-38658149

Basj
  • 41,386
  • 99
  • 383
  • 673
1

The Office Dev Centre has an entry in which a developer has published (MIT licenced at this time) a description of a couple of algorithms that appear to suggest a solution for this (albeit in C#, and require porting):" MS Dev Centre posting

Soferio
  • 483
  • 6
  • 14
  • Very interesting Soferio! Thanks very much for mentioning this; I will study it closely for possible inclusion in the library :) – scanny May 02 '19 at 23:52
1

The library python-docx-template is pretty useful for this. It's perfect to edit Word documents and save them back to .docx format.

A. Attia
  • 1,630
  • 3
  • 20
  • 29
0

The problem with your second attempt is that you haven't defined the parameters that savedocx needs. You need to do something like this before you save:

relationships = docx.relationshiplist()
title = "Document Title"
subject = "Document Subject"
creator = "Document Creator"
keywords = []

coreprops = docx.coreproperties(title=title, subject=subject, creator=creator,
                       keywords=keywords)
app = docx.appproperties()
content = docx.contenttypes()
web = docx.websettings()
word = docx.wordrelationships(relationships)
output = r"path\to\where\you\want\to\save"
wnnmaw
  • 5,444
  • 3
  • 38
  • 63
  • Thank you very much for responding. I immediately added your code before the "save", only changing the output path to "/Users/umityalcin/Desktop/" (I assume that leaving title etc. as is does not matter) However, I am running into additional problems. First, since I had not imported the current docx module (0.7.2) to avoid errors, the interpreter did not recognize the "docx." prefix. So I imported the module--now I am getting this: AttributeError: 'module' object has no attribute 'relationshiplist'. Thanks for your time and help. – user2738815 Jul 17 '14 at 14:33
  • Ah, right, well apparently reading is not my strong suit :P If you have all the functions of ```docx``` in the scope of your program, you don't need the ```docx.``` prefix, so try removing it – wnnmaw Jul 17 '14 at 14:51
  • Well, at least coding is not your weak suit; it seems to be mine :) After following your suggestion, I still managed to get this error: savedocx(document, coreprops, appprops, contenttypes, websettings, wordrelationships, output, imagefiledict) 1061 ) 1062 -> 1063 assert os.path.isdir(template_dir) 1064 docxfile = zipfile.ZipFile( 1065 output, mode='w', compression=zipfile.ZIP_DEFLATED) AssertionError: – user2738815 Jul 17 '14 at 22:57
0

he changed the API in docx py again...

for the sanity of everyone coming here:

import datetime
import os
from decimal import Decimal
from typing import NamedTuple

from docx import Document
from docx.document import Document as nDocument


class DocxInvoiceArg(NamedTuple):
  invoice_to: str
  date_from: str
  date_to: str
  project_name: str
  quantity: float
  hourly: int
  currency: str
  bank_details: str


class DocxService():
  tokens = [
    '@INVOICE_TO@',
    '@IDATE_FROM@',
    '@IDATE_TO@',
    '@INVOICE_NR@',
    '@PROJECTNAME@',
    '@QUANTITY@',
    '@HOURLY@',
    '@CURRENCY@',
    '@TOTAL@',
    '@BANK_DETAILS@',
  ]

  def __init__(self, replace_vals: DocxInvoiceArg):
    total = replace_vals.quantity * replace_vals.hourly
    invoice_nr = replace_vals.project_name + datetime.datetime.strptime(replace_vals.date_to, '%Y-%m-%d').strftime('%Y%m%d')
    self.replace_vals = [
      {'search': self.tokens[0], 'replace': replace_vals.invoice_to },
      {'search': self.tokens[1], 'replace': replace_vals.date_from },
      {'search': self.tokens[2], 'replace': replace_vals.date_to },
      {'search': self.tokens[3], 'replace': invoice_nr },
      {'search': self.tokens[4], 'replace': replace_vals.project_name },
      {'search': self.tokens[5], 'replace': replace_vals.quantity },
      {'search': self.tokens[6], 'replace': replace_vals.hourly },
      {'search': self.tokens[7], 'replace': replace_vals.currency },
      {'search': self.tokens[8], 'replace': total },
      {'search': self.tokens[9], 'replace': 'asdfasdfasdfdasf'},
    ]
    self.doc_path_template = os.path.dirname(os.path.realpath(__file__))+'/docs/'
    self.doc_path_output = self.doc_path_template + 'output/'
    self.document: nDocument = Document(self.doc_path_template + 'invoice_placeholder.docx')


  def save(self):
    for p in self.document.paragraphs:
      self._docx_replace_text(p)
    tables = self.document.tables
    self._loop_tables(tables)
    self.document.save(self.doc_path_output + 'testiboi3.docx')

  def _loop_tables(self, tables):
    for table in tables:
      for index, row in enumerate(table.rows):
        for cell in table.row_cells(index):
          if cell.tables:
            self._loop_tables(cell.tables)
          for p in cell.paragraphs:
            self._docx_replace_text(p)

        # for cells in column.
        # for cell in table.columns:

  def _docx_replace_text(self, p):
    print(p.text)
    for el in self.replace_vals:
      if (el['search'] in p.text):
        inline = p.runs
        # Loop added to work with runs (strings with same style)
        for i in range(len(inline)):
          print(inline[i].text)
          if el['search'] in inline[i].text:
            text = inline[i].text.replace(el['search'], str(el['replace']))
            inline[i].text = text
        print(p.text)

Test case:

from django.test import SimpleTestCase
from docx.table import Table, _Rows

from toggleapi.services.DocxService import DocxService, DocxInvoiceArg


class TestDocxService(SimpleTestCase):

  def test_document_read(self):
    ds = DocxService(DocxInvoiceArg(invoice_to="""
    WAW test1
    Multi myfriend
    """,date_from="2019-08-01", date_to="2019-08-30", project_name='WAW', quantity=10.5, hourly=40, currency='USD',bank_details="""
    Paypal to:
    bippo@bippsi.com"""))

    ds.save()

have folders docs and docs/output/ in same folder where you have DocxService.py

e.g.

enter image description here

be sure to parameterize and replace stuff

Toskan
  • 13,911
  • 14
  • 95
  • 185
0

As shared by some of the fellow users above that one of the challenges is finding and replacing text in word document is retaining styles if the word spans across multiple runs this could happen if word has many styles or if the word was edited multiple times when the document was created. So a simple code which assumes a word would be found completely within a single run is generally not true so python-docx based code shared above may not work for many many scenarios.

You can try the following API

https://rapidapi.com/more.sense.tech@gmail.com/api/document-filter1

This has generic code to deal with the scenarios. The API currently only addresses the paragraphic text and tabular text is currently not supported and I will try that soon.

0
import docx2txt as d2t
from docx import Document
from docx.text.paragraph import Paragraph
document = Document()
all_text = d2t.process("mydata.docx")
# print(all_text)
words=["hey","wow"]
for i in range words:
        all_text=all_text.replace(i,"your word variable")
        document.add_paragraph(updated + "\n")
        print(all_text)
document.save('data.docx')
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 08 '22 at 16:46