1

How would I extract metadata (e.g. FileSize, FileModifyDate, FileAccessDate) from a docx file?

Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72
achen17
  • 37
  • 1
  • 6
  • You should look at the approach suggested in [How to retrieve the author of an office file in python?](https://stackoverflow.com/questions/7021141/how-to-retrieve-author-of-a-office-file-in-python) – dspencer Apr 16 '20 at 03:53
  • Does this answer your question? [How to retrieve the author of an office file in python?](https://stackoverflow.com/questions/7021141/how-to-retrieve-the-author-of-an-office-file-in-python) – parvus Nov 05 '21 at 08:05

3 Answers3

4

You could use python-docx. python-docx has a method core_properties you can utilise. This method gives 15 metadata attributes such as author, category, etc.
See the below code to extract some of the metadata into a python dictionary:

import docx

def getMetaData(doc):
    metadata = {}
    prop = doc.core_properties
    metadata["author"] = prop.author
    metadata["category"] = prop.category
    metadata["comments"] = prop.comments
    metadata["content_status"] = prop.content_status
    metadata["created"] = prop.created
    metadata["identifier"] = prop.identifier
    metadata["keywords"] = prop.keywords
    metadata["last_modified_by"] = prop.last_modified_by
    metadata["language"] = prop.language
    metadata["modified"] = prop.modified
    metadata["subject"] = prop.subject
    metadata["title"] = prop.title
    metadata["version"] = prop.version
    return metadata

doc = docx.Document(file_path)
metadata_dict = getMetaData(doc)
Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72
Hit2Mo
  • 131
  • 1
  • 9
  • Instead of typing all of that out, you could also have used `for d in dir(prop): if not d.startswith('_'): metadata[d] = getattr(prop, d)` (with appropriate line breaks that I don't know how to do in a comment) – Joe Dec 16 '22 at 14:17
  • @Joe - I tried your solution. It fails with an error 'TypeError: 'str' object is not callable' on the dir(prop).method. – Fred Feb 03 '23 at 14:09
  • @Fred, I posted it as an answer, instead of a comment, so that you can copy it exactly as it works for me. Does it still give you the error? – Joe Feb 04 '23 at 14:39
2

Same solution as previous answer - just a little less typing.

import os
import docx

path = '\Your\Path'
os.chdir(path)

fname = 'your.docx'
doc = docx.Document(fname)

prop = doc.core_properties
            
metadata = {}
for d in dir(prop):
    if not d.startswith('_'):
        metadata[d] = getattr(prop, d)
            
print(metadata)
Joe
  • 662
  • 1
  • 7
  • 20
0

Here's a reusable and concise method using the solutions above.

import os
from typing import Dict
import docx
from docx.document import Document
from docx.opc.coreprops import CoreProperties

def get_docx_metadata(docpath:str) -> Dict:
    filename = os.path.basename(docpath)
    doc:Document = docx.Document(docpath)
    props:CoreProperties = doc.core_properties
    metadata = {str(p):getattr(props, p) for p in dir(props) if not str(p).startswith('_')}
    metadata['filepath'] = docpath
    metadata['filename'] = filename
    return metadata
John Bonfardeci
  • 466
  • 3
  • 10