Removing personal information from the comments in a word file using python

Question

I want to remove all the personal information from the comments inside a word file.

Removing the Authors name is fine, I did that using the following,

document = Document('sampleFile.docx')
core_properties = document.core_properties
core_properties.author = ""
document.save('new-filename.docx')

But this is not what I need, I want to remove the name of any person who commented inside that word file.

The way we do it manually is by going into Preferences->security->remove personal information from this file on save

I don't have this package installed, but a general thing you can do is run: `core_properties.__dict__` (note double `_`) which will show you what properties you have to work with. — nbryans, Jun 21 '16 at 21:48
@nbryans, I used the following code, print(core_properties.__dict__)....but it just gave me {'_element': ' at 0x10a64cdb8>} — sunil pawar, Jun 21 '16 at 21:58
What personal information would you like to remove if the core_properties [are not enough](http://stackoverflow.com/questions/37955062/removing-personal-information-from-word-file-using-python#comment63359440_37955295)? — Jezor, Jun 21 '16 at 22:29
@Jezor, To be more clear, if a person other than the author opens a word file and writes some comments on it, we need to remove the name of the person who commented it but we still want the comments that he made. Basically I am trying to design a blind peer review process — sunil pawar, Jun 21 '16 at 22:31
Could you please update your question? I'm working on the answer (: — Jezor, Jun 21 '16 at 22:49
@Jezor, thank you so much and I have edited the question to be more specific — sunil pawar, Jun 22 '16 at 14:39

score 5 · Answer 1 · answered Jun 21 '16 at 22:00

5

The core properties recognised by the CoreProperties class are listed in the official documentation: http://python-docx.readthedocs.io/en/latest/api/document.html#coreproperties-objects

To overwrite all of them you can set them to an empty string like the one you used to overwrite the authors metadata:

document = Document('sampleFile.docx')
core_properties = document.core_properties
meta_fields= ["author", "category", "comments", "content_status", "created", "identifier", "keywords", "language", "revision", "subject", "title", "version"]
for meta_field in meta_fields:
    setattr(core_properties, meta_field, "")
document.save('new-filename.docx')

answered Jun 21 '16 at 22:00

marcanuy

23,118
9
64
113

All the attributes in core_properties are not useful in what I am trying to do. Is there any other attribute for the Document that would help to remove the personal information – sunil pawar Jun 21 '16 at 22:27
Or instead of making the list by hand, you can extract the attributes from the document using: `meta_fields = [attr for attr in dir(core_properties) if isinstance(getattr(core_properties, attr), str) and not attr.startswith("_")]` (it looks for these attributes of document that are public strings). – Jezor Jun 21 '16 at 22:28
1

To be more clear, if a person other than the author opens a word file and writes some comments on it, we need to remove the name of the person who commented it but we still want the comments that he made. Basically I am trying to design a blind peer review process. – sunil pawar Jun 21 '16 at 22:30

Jezor · Accepted Answer · 2016-06-22T17:27:40.903

5

If you want to remove personal information from the comments in .docx file, you'll have to dive deep into the file itself.

So, .docx is just a .zip archive with word-specific files. We need to overwrite some internal files of it, and the easiest way to do it that I could find is to copy all the files to memory, change whatever we have to change and put it all to a new file.

import re
import os
from zipfile import ZipFile

docx_file_name = '/path/to/your/document.docx'

files = dict()

# We read all of the files and store them in "files" dictionary.
document_as_zip = ZipFile(docx_file_name, 'r')
for internal_file in document_as_zip.infolist():
    file_reader = document_as_zip.open(internal_file.filename, "r")
    files[internal_file.filename] = file_reader.readlines()
    file_reader.close()

# We don't need to read anything more, so we close the file.
document_as_zip.close()

# If there are any comments.
if "word/comments.xml" in files.keys():
    # We will be working on comments file...
    comments = files["word/comments.xml"]

    comments_new = str()

    # Files contents have been read as list of byte strings.
    for comment in comments:
        if isinstance(comment, bytes):
            # Change every author to "Unknown Author".
            comments_new += re.sub(r'w:author="[^"]*"', "w:author=\"Unknown Author\"", comment.decode())

    files["word/comments.xml"] = comments_new

# Remove the old .docx file.
os.remove(docx_file_name)

# Now we want to save old files to the new archive.
document_as_zip = ZipFile(docx_file_name, 'w')
for internal_file_name in files.keys():
    # Those are lists of byte strings, so we merge them...
    merged_binary_data = str()
    for binary_data in files[internal_file_name]:
        # If the file was not edited (therefore is not the comments.xml file).
        if not isinstance(binary_data, str):
            binary_data = binary_data.decode()

        # Merge file contents.
        merged_binary_data += binary_data

    # We write old file contents to new file in new .docx.
    document_as_zip.writestr(internal_file_name, merged_binary_data)

# Close file for writing.
document_as_zip.close()

edited Jun 22 '16 at 17:27

answered Jun 22 '16 at 00:09

Jezor

3,253
2
19
43

can you please tell me where did you found the updateable_zipfile package? Because I searched for it and the only thing I got is the zipFile package and not the updateable_zipfile package. Any advice on how to get it...thanks a lot – sunil pawar Jun 22 '16 at 14:50
Okay, so basically updateable_zipfile is a class inherited from zipfile package. – sunil pawar Jun 22 '16 at 14:56
can you pls tell me how to install the zipfile package – sunil pawar Jun 22 '16 at 15:03
[Here is the code of `UpdateableZipFile`](http://stackoverflow.com/a/35435548/5922757), I simply created a new file called `updateable_zipfile.py`, copied and pasted it inside, and then imported it in my example script. – Jezor Jun 22 '16 at 15:42
UserWarning: Duplicate name: 'word/comments.xml' bytes, compress_type=compress_type) – sunil pawar Jun 22 '16 at 15:55
Thank you for the instructions, I tried but it gave me the above error – sunil pawar Jun 22 '16 at 15:56
Yeah I know, it's giving me the same error. But try opening the document in word, it should work even after the error occurs. I have [posted a comment](http://stackoverflow.com/questions/4653768/overwriting-file-in-ziparchive/35435548#comment63392252_35435548) to answer from which the code is from. – Jezor Jun 22 '16 at 16:00
1

also it corrupts the file, after recovering the file the name of the person who commented is still there – sunil pawar Jun 22 '16 at 16:02
Then you'll have to duplicate all of the files from `.docx` archive and at the same time replace `word/comments.xml`. It worked for me in libre office though. – Jezor Jun 22 '16 at 16:02
1

thanks a lot, I tried to open the document in libre and it works like charm. But it doesn't work with Microsoft office word – sunil pawar Jun 22 '16 at 16:17
I edited my answer and added code to delete the file before writing it again as found [here](http://stackoverflow.com/questions/513788/delete-file-from-zipfile-with-the-zipfile-module). – Jezor Jun 22 '16 at 16:33
It says "cannot import name BadZipFile" – sunil pawar Jun 22 '16 at 16:44
Okay I solve the problem for BadZipFile, but now it gives me the following error, ""There is no item named 'word/comments.xml' in the archive"" – sunil pawar Jun 22 '16 at 16:49
Uf, I it took me some effort but I did it! Now my solution uses only ZipFile, so nothing non-standard. It copies all files to memory, replaces contents of comments file and puts all data into a new .docx file. Probably not an optimal solution, but nothing else worked. – Jezor Jun 22 '16 at 17:19
Hey I tried one thing, I used your first script and ran it. I converted the output .docx file to .odt and then from .odt back to .docx. Imagine what it worked with no errors. – sunil pawar Jun 22 '16 at 17:32
thanks a lot for all your help @Jezor, it means a lot to me – sunil pawar Jun 22 '16 at 17:33
I just need to find a script now that will convert thousands of .docx file to .odt and vice versa – sunil pawar Jun 22 '16 at 17:33
[I have updated the answer](http://stackoverflow.com/questions/37955062/removing-personal-information-from-the-comments-in-a-word-file-using-python/37956562?noredirect=1#comment63395243_37956562), it works with `.docx` files now (: – Jezor Jun 22 '16 at 17:35
Yoy are a genius @Jezor – sunil pawar Jun 22 '16 at 17:44
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/115339/discussion-between-jezor-and-sunil-pawar). – Jezor Jun 22 '16 at 17:45

Removing personal information from the comments in a word file using python

2 Answers2

Linked