How to extract metadata from docx file using Python?

Question

How would I extract metadata (e.g. FileSize, FileModifyDate, FileAccessDate) from a docx file?

You should look at the approach suggested in [How to retrieve the author of an office file in python?](https://stackoverflow.com/questions/7021141/how-to-retrieve-author-of-a-office-file-in-python) — dspencer, Apr 16 '20 at 03:53
Does this answer your question? [How to retrieve the author of an office file in python?](https://stackoverflow.com/questions/7021141/how-to-retrieve-the-author-of-an-office-file-in-python) — parvus, Nov 05 '21 at 08:05

score 4 · Answer 1 · edited Feb 27 '22 at 21:07

4

You could use python-docx. python-docx has a method core_properties you can utilise. This method gives 15 metadata attributes such as author, category, etc.
See the below code to extract some of the metadata into a python dictionary:

import docx

def getMetaData(doc):
    metadata = {}
    prop = doc.core_properties
    metadata["author"] = prop.author
    metadata["category"] = prop.category
    metadata["comments"] = prop.comments
    metadata["content_status"] = prop.content_status
    metadata["created"] = prop.created
    metadata["identifier"] = prop.identifier
    metadata["keywords"] = prop.keywords
    metadata["last_modified_by"] = prop.last_modified_by
    metadata["language"] = prop.language
    metadata["modified"] = prop.modified
    metadata["subject"] = prop.subject
    metadata["title"] = prop.title
    metadata["version"] = prop.version
    return metadata

doc = docx.Document(file_path)
metadata_dict = getMetaData(doc)

edited Feb 27 '22 at 21:07

Davide Fiocco

5,350
5
35
72

answered May 26 '20 at 12:44

Hit2Mo

131
1
9

Instead of typing all of that out, you could also have used `for d in dir(prop): if not d.startswith('_'): metadata[d] = getattr(prop, d)` (with appropriate line breaks that I don't know how to do in a comment) – Joe Dec 16 '22 at 14:17
@Joe - I tried your solution. It fails with an error 'TypeError: 'str' object is not callable' on the dir(prop).method. – Fred Feb 03 '23 at 14:09
@Fred, I posted it as an answer, instead of a comment, so that you can copy it exactly as it works for me. Does it still give you the error? – Joe Feb 04 '23 at 14:39

score 2 · Answer 2 · answered Feb 04 '23 at 14:38

Same solution as previous answer - just a little less typing.

import os
import docx

path = '\Your\Path'
os.chdir(path)

fname = 'your.docx'
doc = docx.Document(fname)

prop = doc.core_properties
            
metadata = {}
for d in dir(prop):
    if not d.startswith('_'):
        metadata[d] = getattr(prop, d)
            
print(metadata)

score 0 · Answer 3 · answered Mar 01 '23 at 17:21

Here's a reusable and concise method using the solutions above.

import os
from typing import Dict
import docx
from docx.document import Document
from docx.opc.coreprops import CoreProperties

def get_docx_metadata(docpath:str) -> Dict:
    filename = os.path.basename(docpath)
    doc:Document = docx.Document(docpath)
    props:CoreProperties = doc.core_properties
    metadata = {str(p):getattr(props, p) for p in dir(props) if not str(p).startswith('_')}
    metadata['filepath'] = docpath
    metadata['filename'] = filename
    return metadata

How to extract metadata from docx file using Python?

3 Answers3

Linked