0

I have a set of .docx documents with text modifications performed with the "track changes" functionality.

For every source.docx file in my set, I would like to programmatically run two operations:

  1. generate two documents, one with all changes rejected and the other with all changes accepted (the tricky step)
  2. convert to plaintext.

In other words, I want to run the following pipelines:

source.docx -> sources-all-changes-rejected.docx -> source-all-rejected-plaintext.txt
source.docx -> sources-all-changes-accepted.docx -> source-all-accepted-plaintext.txt

Is there a way to do this e.g. using soffice --headless?

I tried a solution inspired by Python - Using win32com.client to accept all changes in Word Documents . That approach worked, taking care of using absolute paths and saving as txt document https://learn.microsoft.com/en-us/office/vba/api/word.wdsaveformat . So I have a function that takes a pathlib Path file_path and writes plaintext documents as I wanted:

def output_track_changed_version(file_path, action):
    
    doc.TrackRevisions = False

    # Delete all comments
    if doc.Comments.Count >= 1:
        doc.DeleteAllComments()

    # Accept/reject all revisions
    doc.Revisions.AcceptAll()
    changed_text = doc.Content.Text
    doc.Undo()
    doc.Revisions.RejectAll()
    original_text = doc.Content.Text

    # [CUT: code to dump changed/original strings to file and then...]

    doc.Close(False, False, False)
    word.Application.Quit()

However I don't want to stick to win32com.client, preferring a LibreOffice based solution + Python, that can easily be set up on Linux VMs.

Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72

1 Answers1

-1

I don't know, whether it solves your problem, but you can use docx library (installed with pip install python-docx command) to solve this task.

Example:

import docx

#rejected changes
doc = docx.Document('source.docx')
doc.save('sources-all-changes-rejected.docx')

#txt
r_text = []
for par in doc.paragraphs:
    r_text.append(par.text)
r_text = '\n'.join(r_text)

filename = 'source-all-rejected-plaintext.txt'
with open(filename, 'w') as r_txt:
    r_txt.write(r_text)

#accepted changes
for par in doc.paragraphs:
    #do changes
    pass

#txt
a_text = []
for par in doc.paragraphs:
    a_text.append(par.text)
a_text = '\n'.join(a_text)

filename = 'source-all-accepted-plaintext.txt'
with open(filename, 'w') as a_txt:
    a_txt.write(a_text)

#docx
doc.save('sources-all-changes-accepted.docx')

Then you can loop through all files in the set.

Rohel
  • 9
  • 4
  • Hey thanks! One important thing though, is that what needs to happen in the `# do changes` bit is pretty crucial: the changes are already in the source doc (performed with the "track changes" functionality), and I need to reject/accept all changes before outputting the result. – Davide Fiocco Aug 05 '20 at 21:16