how to get a word count on word document in python?

Question

I am trying to get the word counts of .doc .docx .odt and .pdf type files. This is pretty simple for .txt files but how can I go about doing a word count on the mentioned types?

I'm using python django on Ubuntu and trying to word count the documents words when a user uploads a file through the system.

score 4 · Accepted Answer · edited May 23 '17 at 12:18

4

First you need to read your .doc .docx .odt and .pdf.

Second, count the words (<2.7 version).

edited May 23 '17 at 12:18

Community

1
1

answered Sep 23 '11 at 13:02

DrTyrsa

31,014
7
86
86

mike rodent · Answer 2 · 2023-06-12T09:57:39.237

These answers miss a trick as regards MS Word & .odt.

MS Word records a .docx file's word count whenever it is saved. A .docx file is simply a zip file. Accessing the "Words" (= word count) property therein is simple and can be done with modules from the standard library:

import zipfile
import xml.etree.ElementTree as ET

total_word_count = 0
for docx_file_path in docx_file_paths:
    zin = zipfile.ZipFile(docx_file_path)
    for item in zin.infolist():
        if item.filename == 'docProps/app.xml':
            buffer = zin.read(item.filename)
            root = ET.fromstring(buffer.decode('utf-8'))
            for child in root:
                if child.tag.endswith('Words'):
                    print(f'{docx_file_path} word count {child.text}')
                    total_word_count += int(child.text)
                    
print(f'total word count all files {total_word_count}')

Pros and cons: the main pro is that, for most files, this is going to be far faster than anything else.

The main con is that you're stuck with the various idiosyncracies of MS Word's counting methods: I am not particularly interested in the details but I know that these have changed over the versions (e.g. words in text boxes may or may not be included). However, the same sort of complications apply if you choose to pick apart and parse the entire text content of a .docx file. The various available modules, e.g. python-docx, seem to do a pretty good job, but in my experience none is perfect.

If you actually extract and parse, by yourself, the content.xml file inside a .docx file, you begin to realise that there are some daunting complexities involved.

.odt files
again, these are zip files, and again a similar property is found in meta.xml. I just created and unzipped one such file and meta.xml in it looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<office:document-meta xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:grddl="http://www.w3.org/2003/g/data-view#" office:version="1.3">
    <office:meta>
        <meta:creation-date>2023-06-11T18:25:09.898000000</meta:creation-date>
        <dc:date>2023-06-11T18:25:21.656000000</dc:date>
        <meta:editing-duration>PT11S</meta:editing-duration>
        <meta:editing-cycles>1</meta:editing-cycles>
        <meta:document-statistic meta:table-count="0" meta:image-count="0" meta:object-count="0" meta:page-count="1" meta:paragraph-count="1" meta:word-count="2" meta:character-count="12" meta:non-whitespace-character-count="11"/>
        <meta:generator>LibreOffice/7.4.6.2$Windows_X86_64 LibreOffice_project/5b1f5509c2decdade7fda905e3e1429a67acd63d</meta:generator>
    </office:meta>
</office:document-meta>

Thus you need to look at root['office:meta']['meta:document-statistic'], attribute meta:word-count.

I don't know about PDF: they may well require brute force counting. Pypdf2 looks the way to go: the simplest way would be to convert to txt and count that way. I have no idea what might be missed out.
And a scanned PDF, for example, may be hundreds of pages long but be said to contain "0 words". Or indeed there may be scanned text interspersed with bona fide text content...

score 0 · Answer 3 · edited May 23 '17 at 10:34

0

Given that you can do this for .txt files I'll assume that you know how to count the words, and that you just need to know how to read the various file types. Take a look at these libraries:

PDF: pypdf

doc/docx: this question, python-docx

odt: examples here

edited May 23 '17 at 10:34

Community

1
1

answered Sep 23 '11 at 18:35

andronikus

4,125
2
29
46

I used python-docx for docx files. i found pdfminer to be better than pypdf for converting pdf to text. guess i'll have to go with antiword for .doc files. still to checkout odt. thanks for your response. – darren Sep 25 '11 at 12:52

score 0 · Answer 4 · answered Aug 22 '23 at 09:00

Noted by @Chad 's answer at extracting text from MS word files in python.

import zipfile, re

docx = zipfile.ZipFile('/path/to/file/mydocument.docx')

content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)

word_count = len(cleaned)

how to get a word count on word document in python?

4 Answers4