How would I extract metadata (e.g. FileSize, FileModifyDate, FileAccessDate) from a docx file?
Asked
Active
Viewed 5,253 times
1
-
You should look at the approach suggested in [How to retrieve the author of an office file in python?](https://stackoverflow.com/questions/7021141/how-to-retrieve-author-of-a-office-file-in-python) – dspencer Apr 16 '20 at 03:53
-
Does this answer your question? [How to retrieve the author of an office file in python?](https://stackoverflow.com/questions/7021141/how-to-retrieve-the-author-of-an-office-file-in-python) – parvus Nov 05 '21 at 08:05
3 Answers
4
You could use python-docx
. python-docx
has a method core_properties
you can utilise. This method gives 15 metadata attributes such as author, category, etc.
See the below code to extract some of the metadata into a python dictionary:
import docx
def getMetaData(doc):
metadata = {}
prop = doc.core_properties
metadata["author"] = prop.author
metadata["category"] = prop.category
metadata["comments"] = prop.comments
metadata["content_status"] = prop.content_status
metadata["created"] = prop.created
metadata["identifier"] = prop.identifier
metadata["keywords"] = prop.keywords
metadata["last_modified_by"] = prop.last_modified_by
metadata["language"] = prop.language
metadata["modified"] = prop.modified
metadata["subject"] = prop.subject
metadata["title"] = prop.title
metadata["version"] = prop.version
return metadata
doc = docx.Document(file_path)
metadata_dict = getMetaData(doc)

Davide Fiocco
- 5,350
- 5
- 35
- 72

Hit2Mo
- 131
- 1
- 9
-
Instead of typing all of that out, you could also have used `for d in dir(prop): if not d.startswith('_'): metadata[d] = getattr(prop, d)` (with appropriate line breaks that I don't know how to do in a comment) – Joe Dec 16 '22 at 14:17
-
@Joe - I tried your solution. It fails with an error 'TypeError: 'str' object is not callable' on the dir(prop).method. – Fred Feb 03 '23 at 14:09
-
@Fred, I posted it as an answer, instead of a comment, so that you can copy it exactly as it works for me. Does it still give you the error? – Joe Feb 04 '23 at 14:39
2
Same solution as previous answer - just a little less typing.
import os
import docx
path = '\Your\Path'
os.chdir(path)
fname = 'your.docx'
doc = docx.Document(fname)
prop = doc.core_properties
metadata = {}
for d in dir(prop):
if not d.startswith('_'):
metadata[d] = getattr(prop, d)
print(metadata)

Joe
- 662
- 1
- 7
- 20
0
Here's a reusable and concise method using the solutions above.
import os
from typing import Dict
import docx
from docx.document import Document
from docx.opc.coreprops import CoreProperties
def get_docx_metadata(docpath:str) -> Dict:
filename = os.path.basename(docpath)
doc:Document = docx.Document(docpath)
props:CoreProperties = doc.core_properties
metadata = {str(p):getattr(props, p) for p in dir(props) if not str(p).startswith('_')}
metadata['filepath'] = docpath
metadata['filename'] = filename
return metadata

John Bonfardeci
- 466
- 3
- 10