3

I am trying to get the counts for frequencies of words occurring in multiple files in a directory and thanks to this answer here I was able to get results for when the word did occur. However, I can't figure out how to also get the results to display when there are 0 occurrences of the word too.

e.g. This is the kind of result I want, so I always get results for all specified words, with the specified word in the first row and the count below.

21, 23, 60 4, 0, 8

Here's my current code:

import csv
import copy
import os
import sys
import glob
import string
import fileinput
from collections import Counter

def word_frequency(fileobj, words):
    """Build a Counter of specified words in fileobj"""
    # initialise the counter to 0 for each word
    ct = Counter(dict((w, 0) for w in words))
    file_words = (word for line in fileobj for word in line.split())
    filtered_words = (word for word in file_words if word in words)
    return Counter(filtered_words)


def count_words_in_dir(dirpath, words, action):
    """For each .txt file in a dir, count the specified words"""
        for filepath in glob.iglob(os.path.join(dirpath, '*.txt_out')):
            filepath = {}
        with open(filepath) as f:
            ct = word_frequency(f, words)
            action(filepath, ct)


def final_summary(filepath, ct):
    words = sorted(ct.keys())
    counts = [str(ct[k]) for k in words]
    with open('new.csv','a') as f:
        [f.write('{0},{1}\n,{2}\n'.format(
            filepath,
        ', '.join(words),
        ', '.join(counts)))]


words = set(['21','23','60','75','79','86','107','121','147','193','194','197','198','199','200','201','229','241','263','267','309','328'])
count_words_in_dir('C:\\Users\jllevent\Documents\PE Submsissions\Post-CLI', words, action=final_summary)
Community
  • 1
  • 1
Jack Levent
  • 75
  • 1
  • 5
  • Some additional notes: You can save some work initializing `ct` by letting `dict.fromkeys` do the work for you: `ct = Counter(dict.fromkeys(words, 0))`. And you can push more work to the C layer in other ways by using builtins with C builtin functions, e.g. `file_words = itertools.chain.from_iterable(map(str.split, fileobj))` and `filtered_words = filter(frozenset(words).__contains__, file_words)`, followed by `Counter.update(filtered_words)` (although `Counter.update` is implemented in Python, the heavy lifting in modern Python is done by the C accelerated `collections._count_elements`). – ShadowRanger Jun 03 '16 at 18:18
  • Note: `map` and `filter` should be the Python 3 versions to get best performance without large wasteful temporary objects; in Python 2, you can do `from future_builtins import map, filter` to get the generator based versions of these functions. Also, if you're on Python 2.7/3.1 or higher, you can use `set` literals instead of a `list` literal wrapped in `set` constructor: `word = {'21','23','60','75','79','86','107','121','147','193','194','197','198','199','200','201','229','241','263','267','309','328'}` – ShadowRanger Jun 03 '16 at 18:19

2 Answers2

1

You are never using the ct Counter you constructed in word_frequency but constructing a new Counter that only has the existing words, you need to use your constructed ct, e.g.:

...
for word in file_words:
    if word in words:
        ct[word] += 1
return ct

Or as pointed out by @ShadowRanger below:

ct.update(word for word in file_words if word in words)
return ct
AChampion
  • 29,683
  • 4
  • 59
  • 75
-1

It looks like it's returning NULL if the word doesn't appear. Put in a conditioned return statement, where if the value it's returning isn't an int > 0, return 0.