Bash wc reporting much lower wordcount than LibreOffice and Google Docs

Question

I'm trying to complete NaNoWriMo which involves keeping track of your wordcount in order to meet the goal of writing 50,000 words. I've been doing so using a Python script:

import glob
def count_words(ftype):
    wordcount = 0
    for found_file in glob.glob(ftype):
        with open(found_file, 'r') as chapter:
            for line in chapter:
                if line.strip():
                    words = line.split(' ')
                    wordcount += len(words)
    return wordcount

>>> count_words('*md')
14696

However, I've just realized that the Bash 'wc' command (which I just learned about) disagrees:

~/nano$ wc *md -w
 2656 ch01.md
  438 ch02.md
 2112 ch03.md
 1246 ch04.md
 2367 ch05.md
 2131 ch06.md
 1406 ch07.md
 1060 ch08.md
   21 rules.md
13437 total

So the total wordcount reported by WC is only 13,437 words.

Dammit, I'm behind! What's going on? LibreOffice and Google Sheets, by the way, agree with bash, so I'm tagging this as a Python question because I'm pretty sure that the problem is with my script.

What format are you writing these docs in? If its not plain text or something lightweight like markdown, `wc` isn't going to give you an accurate count. Suppose you try to count a .docx file.... its compressed xml, not words. — tdelaney, Nov 09 '16 at 18:48

score 0 · Answer 1 · answered Nov 09 '16 at 18:49

0

Figured it out: it was counting the \n character at the end of the line as a separate word (because I offset it with a space to go between the last word of that line and the first word of the next).

Well, at least I caught it early...

Updated code:

import glob
def count_words(ftype):
    wordcount = 0
    for found_file in glob.glob(ftype):
        with open(found_file, 'r') as chapter:
            for line in chapter:
                line = line.strip()
                if line:
                    words = line.split(' ')
                    wordcount += len(words)
    return wordcount

>>> count_words('*md')
13895

answered Nov 09 '16 at 18:49

Ben Quigley

727
4
18

1

You may find `words = re.findall(r'\w+', line)` more accurate still. It will skip the newline other markdown formatting. – tdelaney Nov 09 '16 at 18:55
1

Using only `line.split()` without arguments normally separate words. See the answers from [this](http://stackoverflow.com/questions/19410018/how-to-count-the-number-of-words-in-a-sentence) post for ways to do it. – Dart Feld Nov 09 '16 at 18:56
How many words do you people think I want to reason myself out of??? ;) – Ben Quigley Nov 09 '16 at 18:59

Bash wc reporting much lower wordcount than LibreOffice and Google Docs

1 Answers1