walk directory and count words from all files and subdirectories and accumulate totals

Question

Hello stackoverflow community! I've used this community for years to accomplish small one off projects for work, school, and personal exploration; however, this is the first question i've posted ... so be delicate ;)

I'm trying to read every file from a directory and all subdirectories, then accumulate the results to one dictionary with Python. Right now the script (see below) is reading all files as required but the results are individually for each file. I'm looking for help to accumulate into one.

Code

import re
import os
import sys
import os.path
import fnmatch
import collections

def search( file ):

    if os.path.isdir(path) == True:
        for root, dirs, files in os.walk(path):
            for file in files:
              #  words = re.findall('\w+', open(file).read().lower())
                words = re.findall('\w+', open(os.path.join(root, file)).read().lower())
                ignore = ['the','a','if','in','it','of','or','on','and','to']
                counter=collections.Counter(x for x in words if x not in ignore)
                print(counter.most_common(10))

    else:
        words = re.findall('\w+', open(path).read().lower())
        ignore = ['the','a','if','in','it','of','or','on','and','to']
        counter=collections.Counter(x for x in words if x not in ignore)
        print(counter.most_common(10))

path = raw_input("Enter file and path")

Results

Enter file and path./dirTest

[('this', 1), ('test', 1), ('is', 1), ('just', 1)]

[('this', 1), ('test', 1), ('is', 1), ('just', 1)]

[('test', 2), ('is', 2), ('just', 2), ('this', 1), ('really', 1)]

[('test', 3), ('just', 2), ('this', 2), ('is', 2), ('power', 1),
('through', 1), ('really', 1)]

[('this', 2), ('another', 1), ('is', 1), ('read', 1), ('can', 1),
('file', 1), ('test', 1), ('you', 1)]

Desired Results - Example

[('this', 5), ('another', 1), ('is', 5), ('read', 1), ('can', 1),
('file', 1), ('test', 5), ('you', 1), ('power', 1), ('through', 1),
('really', 2)]

Any guidance would be greatly appreciated!

Chamath · Accepted Answer · 2018-02-28T01:45:16.503

0

Problem is with your print statements and the usage of Counter object. I would suggest follows.

ignore = ['the', 'a', 'if', 'in', 'it', 'of', 'or', 'on', 'and', 'to']

def extract(file_path, counter):
    words = re.findall('\w+', open(file_path).read().lower())
    counter.update([x for x in words if x not in ignore])

def search(file):
    counter = collections.Counter()

    if os.path.isdir(path):
        for root, dirs, files in os.walk(path):
            for file in files:
                extract(os.path.join(root, file), counter)
    else:
        extract(path, counter)

    print(counter.most_common(10))

You can seperate the common lines of code. Also os.path.isdir(path) returns a bool value and therefore you can directly use it for if condition without comparing.

Initial solution: My solution is to append all your words to a one list and then use that list with Counter. That way you can produce one output with your results.

According to the performance impact mentioned by @ShadowRanger you can directly update the counter instead of using a seperate list.

edited Feb 28 '18 at 01:45

answered Feb 28 '18 at 01:36

Chamath

2,016
2
21
30

I'd strongly recommend `update`ing the `Counter` as you go, not preserving a non-unique `list` of all words until the end and then counting them. If the input files are of any real size, storing all non-unique words in them will exhaust RAM, where storing only the unique words and the count will use substantially less. Even if it doesn't exhaust RAM, it's still consuming much more memory than simple counting requires. – ShadowRanger Feb 28 '18 at 01:38
@ShadowRanger Thanks for the performance point of view. I updated the answer. – Chamath Feb 28 '18 at 01:46
i have a follow up question - adding support for .pdf files to this solution. There are answers to parsing .pdf files with Python on StackOverflow (like this one : https://stackoverflow.com/questions/18755412/parse-a-pdf-using-python) but i'm not sure if its an elgant solution to integrate with the counter solution (extract(os.path.join(root, file), counter)). – MosaixSolutions Mar 02 '18 at 17:29

ShadowRanger · Answer 2 · 2018-02-28T01:43:23.660

Looks like you want a single Counter with all the accumulated stats that you print at the end, but instead you're making a Counter for each file, printing it, then throwing it away. You just need to move Counter initialization and printing outside your loop, and only update the "one true Counter" for each file:

def search( file ):
    # Initialize empty Counter up front
    counter = Counter()
    # Create ignore only once, and make it a set, so membership tests go faster
    ignore = {'the','a','if','in','it','of','or','on','and','to'}
    if os.path.isdir(path):  # Comparing to True is anti-pattern; removed
        for root, dirs, files in os.walk(path):
            for file in files:
                words = re.findall('\w+', open(os.path.join(root, file)).read().lower())
                # Update common Counter
                counter.update(x for x in words if x not in ignore)

    else:
        words = re.findall('\w+', open(path).read().lower())
        # Update common Counter
        counter.update(x for x in words if x not in ignore)
    # Do a single print at the end
    print(counter.most_common(10))

You could factor out the common code here if you wished, e.g.:

def update_counts_for_file(path, counter, ignore=()):
    with open(path) as f:  # Using with statements is good, always do it
        words = re.findall('\w+', f.read().lower())
    counter.update(x for x in words if x not in ignore)

allowing you to replace the repetitive code with a call to the factored out code, but unless the code gets significantly more complicated, it's probably not worth factoring out two lines repeated only twice.

score -1 · Answer 3 · answered Feb 28 '18 at 01:24

-1

I see that you are trying to find certain keywords from the file/dir scan and getting the count of occurrences

basically you can get a list of all such occurrences and then find the count of each like so

def couunt_all(array):
    nodup = list(set(array))
    for i in nodup:
        print(i,array.count(i))        

array = ['this','this','this','is','is']
print(couunt_all(array))
out:
('this', 3)
('is', 2)

answered Feb 28 '18 at 01:24

Manjit Ullal

106
8

This is a solution to a loosely related problem, and doesn't help solve the OP's issue at all. You managed to make your solution asymptotically worse too, turning `O(n)` work into `O(n²)` work. – ShadowRanger Feb 28 '18 at 01:30

walk directory and count words from all files and subdirectories and accumulate totals

Code

Results

Desired Results - Example

3 Answers3