How to retrieve the author of an office file in python?

Question

Title explains the problem, there are doc and docs files that which I want to retrieive their author information so that I can restructure my files.

os.stat returns only size and datetime, real-file related information.
open(filename, 'rb').read(200) returns many characters that I could not parse.

There is a module called xlrd for reading xlsx files. Yet, this still doesn't let me read doc or docx files. I am aware of new office files are not easily read on non-msoffice programs, so if that's impossible, gathering info from old office files would suffice.

Does this have to work without Word installed? You could always use the Word COM object if there are n native libraries. — Jacob, Aug 11 '11 at 05:38
Since I am creating a utility script for myself, it doesn't *have to*, I can use it on windows and I'll be fine. But yet, I would like it to work on any platform w/o Word installed. So, my preferred choice would be no dependency on installed software. I'll check the `COM object` out also. — Umur Kontacı, Aug 11 '11 at 05:50

Zach Kelling · Accepted Answer · 2011-08-11T06:24:01.860

6

Since docx files are just zipped XML you could just unzip the docx file and presumably pull the author information out of an XML file. Not quite sure where it'd be stored, just looking around at it briefly leads me to suspect it's stored as dc:creator in docProps/core.xml.

Here's how you can open the docx file and retrieve the creator:

import zipfile, lxml.etree

# open zipfile
zf = zipfile.ZipFile('my_doc.docx')
# use lxml to parse the xml file we are interested in
doc = lxml.etree.fromstring(zf.read('docProps/core.xml'))
# retrieve creator
ns={'dc': 'http://purl.org/dc/elements/1.1/'}
creator = doc.xpath('//dc:creator', namespaces=ns)[0].text

edited Aug 11 '11 at 06:24

answered Aug 11 '11 at 06:04

Zach Kelling

52,505
13
109
108

Hi, sorry for reviving such an old post but how would you also check for membership? For instance if I want to check if `subject` exists apart from `creator`, how would I do it? Thanks in advance. – stratis May 28 '14 at 10:55
@kstratis you probably should make a new question, and in that new question it could help to add a note that references this question. – Raj Sep 27 '17 at 15:26

score 2 · Answer 2 · answered Sep 21 '18 at 17:41

How about using docx library. You could pull more information about the file not only author.

#sudo pip install python-docx
#sudo pip2 install python-docx
#sudo pip3 install python-docx


import docx

file_name = 'file_path_name.doxs'

document = docx.Document(docx = file_name)
core_properties = document.core_properties
print(core_properties.author)
print(core_properties.created)
print(core_properties.last_modified_by)
print(core_properties.last_printed)
print(core_properties.modified)
print(core_properties.revision)
print(core_properties.title)
print(core_properties.category)
print(core_properties.comments)
print(core_properties.identifier)
print(core_properties.keywords)
print(core_properties.language)
print(core_properties.subject)
print(core_properties.version)
print(core_properties.keywords)
print(core_properties.content_status)

find more information about the docx library here and the github account is here

score 2 · Answer 3 · answered Aug 11 '11 at 06:08

You can use COM interop to access the Word object model. This link talks about the technique: http://www.blog.pythonlibrary.org/2010/07/16/python-and-microsoft-office-using-pywin32/

The secret when working with any of the office objects is knowing what item to access from the overwhelming amount of methods and properties. In this case each document has a list of BuiltInDocumentProperties . The property of interest is "Last Author".

After you open the document you will access the author with something like word.ActiveDocument.BuiltInDocumentProperties("Last Author")

score 2 · Answer 4 · edited Oct 22 '21 at 17:42

2

For old office documents (.doc, .xls) you can use hachoir-metadata.

It does not work well with the new file formats: for example, it can parse .xlsx files, but will not provide you with an Author name.

edited Oct 22 '21 at 17:42

parvus

5,706
6
36
62

answered Aug 12 '11 at 09:39

johnbaum

664
4
5

score 1 · Answer 5 · answered Oct 23 '21 at 17:24

The newer Office formats are just zip containers containing xml files. You can have a look here https://github.com/profHajal/Microsoft-Office-Documents-Metadata-with-Python/blob/main/mso_md.py for a very simple straightforward approach.

The code listed is easily extendable for OpenOffice formats.

Pseudocode:

z = zipfile.ZipFile(filename, 'r')
data = _zipfile.read('docProps/core.xml')
    or
data = _zipfile.read('meta.xml')
doc = xml.dom.minidom.parseString(data)
tag = "data you're interested in"
metadata_string = doc.getElementsByTagName(tag)[0].childNodes[0].data

Files to search metadata in:

docProps/core.xml for MS Office files
meta.xml for OpenOffice files

A non-exhaustive list of tags you can search for:

From the Dublin core namespace rules: `dc`

Title: dc:title
Creator (of most recent modification): dc:creator
Description: dc:description
Subject: dc:subject
Date (last modified): dc:date
Language: ???

From the ODF specification: `meta`

Generator (creating software application): meta:generator
Keywords: meta:keyword
Initial Creator: ???
Creation Date and Time: meta:creation-date
Modification Date and Time: ???
Print Date and Time: ???
Document Template: meta:template (data in attributes)
Document Statistics (word count, page count, etc.): meta:document-statistic (data in attributes)

MS Office specific:

Creation Date and Time: dcterms:created
Date (last modified): dcterms:modified
Creator of most recent modification: cp:lastModifiedBy

How to retrieve the author of an office file in python?

5 Answers5

Files to search metadata in:

A non-exhaustive list of tags you can search for:

From the Dublin core namespace rules: `dc`

From the ODF specification: `meta`

MS Office specific:

Linked

How to retrieve the author of an office file in python?

5 Answers5

Files to search metadata in:

A non-exhaustive list of tags you can search for:

From the Dublin core namespace rules: dc

From the ODF specification: meta

MS Office specific:

Linked

From the Dublin core namespace rules: `dc`

From the ODF specification: `meta`