These answers miss a trick as regards MS Word & .odt.
MS Word records a .docx file's word count whenever it is saved. A .docx file is simply a zip file. Accessing the "Words" (= word count) property therein is simple and can be done with modules from the standard library:
import zipfile
import xml.etree.ElementTree as ET
total_word_count = 0
for docx_file_path in docx_file_paths:
zin = zipfile.ZipFile(docx_file_path)
for item in zin.infolist():
if item.filename == 'docProps/app.xml':
buffer = zin.read(item.filename)
root = ET.fromstring(buffer.decode('utf-8'))
for child in root:
if child.tag.endswith('Words'):
print(f'{docx_file_path} word count {child.text}')
total_word_count += int(child.text)
print(f'total word count all files {total_word_count}')
Pros and cons: the main pro is that, for most files, this is going to be far faster than anything else.
The main con is that you're stuck with the various idiosyncracies of MS Word's counting methods: I am not particularly interested in the details but I know that these have changed over the versions (e.g. words in text boxes may or may not be included). However, the same sort of complications apply if you choose to pick apart and parse the entire text content of a .docx file. The various available modules, e.g. python-docx, seem to do a pretty good job, but in my experience none is perfect.
If you actually extract and parse, by yourself, the content.xml file inside a .docx file, you begin to realise that there are some daunting complexities involved.
.odt files
again, these are zip files, and again a similar property is found in meta.xml. I just created and unzipped one such file and meta.xml in it looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<office:document-meta xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:grddl="http://www.w3.org/2003/g/data-view#" office:version="1.3">
<office:meta>
<meta:creation-date>2023-06-11T18:25:09.898000000</meta:creation-date>
<dc:date>2023-06-11T18:25:21.656000000</dc:date>
<meta:editing-duration>PT11S</meta:editing-duration>
<meta:editing-cycles>1</meta:editing-cycles>
<meta:document-statistic meta:table-count="0" meta:image-count="0" meta:object-count="0" meta:page-count="1" meta:paragraph-count="1" meta:word-count="2" meta:character-count="12" meta:non-whitespace-character-count="11"/>
<meta:generator>LibreOffice/7.4.6.2$Windows_X86_64 LibreOffice_project/5b1f5509c2decdade7fda905e3e1429a67acd63d</meta:generator>
</office:meta>
</office:document-meta>
Thus you need to look at root['office:meta']['meta:document-statistic']
, attribute meta:word-count
.
I don't know about PDF: they may well require brute force counting. Pypdf2 looks the way to go: the simplest way would be to convert to txt and count that way. I have no idea what might be missed out.
And a scanned PDF, for example, may be hundreds of pages long but be said to contain "0 words". Or indeed there may be scanned text interspersed with bona fide text content...