I have a set of .docx documents with text modifications performed with the "track changes" functionality.
For every source.docx
file in my set, I would like to programmatically run two operations:
- generate two documents, one with all changes rejected and the other with all changes accepted (the tricky step)
- convert to plaintext.
In other words, I want to run the following pipelines:
source.docx
-> sources-all-changes-rejected.docx
-> source-all-rejected-plaintext.txt
source.docx
-> sources-all-changes-accepted.docx
-> source-all-accepted-plaintext.txt
Is there a way to do this e.g. using soffice --headless
?
I tried a solution inspired by Python - Using win32com.client to accept all changes in Word Documents . That approach worked, taking care of using absolute paths and saving as txt document https://learn.microsoft.com/en-us/office/vba/api/word.wdsaveformat . So I have a function that takes a pathlib
Path file_path
and writes plaintext documents as I wanted:
def output_track_changed_version(file_path, action):
doc.TrackRevisions = False
# Delete all comments
if doc.Comments.Count >= 1:
doc.DeleteAllComments()
# Accept/reject all revisions
doc.Revisions.AcceptAll()
changed_text = doc.Content.Text
doc.Undo()
doc.Revisions.RejectAll()
original_text = doc.Content.Text
# [CUT: code to dump changed/original strings to file and then...]
doc.Close(False, False, False)
word.Application.Quit()
However I don't want to stick to win32com.client
, preferring a LibreOffice based solution + Python, that can easily be set up on Linux VMs.