37

I'd like to count frequencies of all words in a text file.

>>> countInFile('test.txt')

should return {'aaa':1, 'bbb': 2, 'ccc':1} if the target text file is like:

# test.txt
aaa bbb ccc
bbb

I've implemented it with pure python following some posts. However, I've found out pure-python ways are insufficient due to huge file size (> 1GB).

I think borrowing sklearn's power is a candidate.

If you let CountVectorizer count frequencies for each line, I guess you will get word frequencies by summing up each column. But, it sounds a bit indirect way.

What is the most efficient and straightforward way to count words in a file with python?

Update

My (very slow) code is here:

from collections import Counter

def get_term_frequency_in_file(source_file_path):
    wordcount = {}
    with open(source_file_path) as f:
        for line in f:
            line = line.lower().translate(None, string.punctuation)
            this_wordcount = Counter(line.split())
            wordcount = add_merge_two_dict(wordcount, this_wordcount)
    return wordcount

def add_merge_two_dict(x, y):
    return { k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y) }
Community
  • 1
  • 1
Light Yagmi
  • 5,085
  • 12
  • 43
  • 64
  • splitting words in python will have to allocate memory for list and create a lot of str objects too, also dictionary creation, python hash is not very fast. for maximum performance you can write C extension, look for word boundaries without copying memory, then use fastest hash to count it and when its done, create python dict. – YOU Mar 08 '16 at 02:08
  • Are you matching certain words, or trying to count every unique "word." How many unique words do you expect to find in a 1 GB sized file? Also, how long are the lines, on average? – Nostalg.io Mar 08 '16 at 02:15
  • You probably can't improve *that* much on execution time by switching to C or some module (a basic Python test on a dataset of 950M takes me 25s, which is not that slow). The problem is that it stores all the words in memory (so you need at least 1G of free memory). If you data is limited to 1G, that's probably okay. Using something like SQLite/MySQL would solve the memory problem but would require disk access which is a lot more slower; so what "efficiency" are you looking for? memory-efficient? CPU-efficient? disk-efficient? time-efficient? – Martin Tournoij Mar 08 '16 at 02:36

8 Answers8

50

The most succinct approach is to use the tools Python gives you.

from future_builtins import map  # Only on Python 2

from collections import Counter
from itertools import chain

def countInFile(filename):
    with open(filename) as f:
        return Counter(chain.from_iterable(map(str.split, f)))

That's it. map(str.split, f) is making a generator that returns lists of words from each line. Wrapping in chain.from_iterable converts that to a single generator that produces a word at a time. Counter takes an input iterable and counts all unique values in it. At the end, you return a dict-like object (a Counter) that stores all unique words and their counts, and during creation, you only store a line of data at a time and the total counts, not the whole file at once.

In theory, on Python 2.7 and 3.1, you might do slightly better looping over the chained results yourself and using a dict or collections.defaultdict(int) to count (because Counter is implemented in Python, which can make it slower in some cases), but letting Counter do the work is simpler and more self-documenting (I mean, the whole goal is counting, so use a Counter). Beyond that, on CPython (the reference interpreter) 3.2 and higher Counter has a C level accelerator for counting iterable inputs that will run faster than anything you could write in pure Python.

Update: You seem to want punctuation stripped and case-insensitivity, so here's a variant of my earlier code that does that:

from string import punctuation

def countInFile(filename):
    with open(filename) as f:
        linewords = (line.translate(None, punctuation).lower().split() for line in f)
        return Counter(chain.from_iterable(linewords))

Your code runs much more slowly because it's creating and destroying many small Counter and set objects, rather than .update-ing a single Counter once per line (which, while slightly slower than what I gave in the updated code block, would be at least algorithmically similar in scaling factor).

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • 1
    I've found that (in C-Python) `defaultdict(int)` is faster than `Counter` in Python 2, but the other way around in Python 3. BTW, this is an excellent answer. Whatever happened to up-voting on this site? – mhawke Mar 08 '16 at 03:59
  • Thank you @ShadowRanger. Your code works perfectly! But, please see my previous code in updated question. I've also used `Counter`. What's wrong in my code? – Light Yagmi Mar 08 '16 at 05:22
  • 1
    @rkjt50r983: Well, among other things, creating many `Counter`s and combining them is a lot more costly than creating one; if you don't like the overly concise code I provided, I'd still suggest creating a single `Counter` and calling `.update` on it with the words from each line, which increases the counts for the single `Counter` in place, rather than creating whole new `Counter`s and combined `dict`s at each step. – ShadowRanger Mar 08 '16 at 19:49
  • 1
    @mattsap: `str.split` with no arguments splits on runs of whitespace, and does not return empty groups at the ends when the string begins or ends with whitespace, making it effectively a `strip` followed by `split` on runs of whitespace. Also, side-note, the ordering on Windows is `\r\n`, not `\n\r`, though `str.rstrip` is order insensitive, so either order works when you're trying to strip newlines (but no other whitespace) from an input line. – ShadowRanger Mar 16 '16 at 19:37
  • 1
    @mhawke: Late update: I went and checked; as of Python 3.2, `Counter` has a [C-accelerated helper function for updating itself by counting an input iterable](https://github.com/python/cpython/blob/3.2/Lib/collections.py#L396), which is almost certainly responsible for the speedup relative to `defaultdict(int)`. I hadn't noticed it because at the time I wrote this, I was looking at 2.7 code (the OP's code uses a 2.x version of `str.translate`). Nice to know you don't sacrifice any speed for `Counter`'s convenience anymore. – ShadowRanger Oct 27 '17 at 20:48
16

A memory efficient and accurate way is to make use of

  • CountVectorizer in scikit (for ngram extraction)
  • NLTK for word_tokenize
  • numpy matrix sum to collect the counts
  • collections.Counter for collecting the counts and vocabulary

An example:

import urllib.request
from collections import Counter

import numpy as np 

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

# Our sample textfile.
url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
response = urllib.request.urlopen(url)
data = response.read().decode('utf8')


# Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens.
ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
# X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
X = ngram_vectorizer.fit_transform(data.split('\n'))

# Vocabulary
vocab = list(ngram_vectorizer.get_feature_names())

# Column-wise sum of the X matrix.
# It's some crazy numpy syntax that looks horribly unpythonic
# For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
# and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
counts = X.sum(axis=0).A1

freq_distribution = Counter(dict(zip(vocab, counts)))
print (freq_distribution.most_common(10))

[out]:

[(',', 32000),
 ('.', 17783),
 ('de', 11225),
 ('a', 7197),
 ('que', 5710),
 ('la', 4732),
 ('je', 4304),
 ('se', 4013),
 ('на', 3978),
 ('na', 3834)]

Essentially, you can also do this:

from collections import Counter
import numpy as np 
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

def freq_dist(data):
    """
    :param data: A string with sentences separated by '\n'
    :type data: str
    """
    ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
    X = ngram_vectorizer.fit_transform(data.split('\n'))
    vocab = list(ngram_vectorizer.get_feature_names())
    counts = X.sum(axis=0).A1
    return Counter(dict(zip(vocab, counts)))

Let's timeit:

import time

start = time.time()
word_distribution = freq_dist(data)
print (time.time() - start)

[out]:

5.257147789001465

Note that CountVectorizer can also take a file instead of a string and there's no need to read the whole file into memory. In code:

import io
from collections import Counter

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

infile = '/path/to/input.txt'

ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)

with io.open(infile, 'r', encoding='utf8') as fin:
    X = ngram_vectorizer.fit_transform(fin)
    vocab = ngram_vectorizer.get_feature_names()
    counts = X.sum(axis=0).A1
    freq_distribution = Counter(dict(zip(vocab, counts)))
    print (freq_distribution.most_common(10))
alvas
  • 115,346
  • 109
  • 446
  • 738
4

This should suffice.

def countinfile(filename):
    d = {}
    with open(filename, "r") as fin:
        for line in fin:
            words = line.strip().split()
            for word in words:
                try:
                    d[word] += 1
                except KeyError:
                    d[word] = 1
    return d
Goodies
  • 4,439
  • 3
  • 31
  • 57
  • 3
    FYI, no need to `strip()` before `split`-ing when the `split` isn't given arguments; no arg `split` already ignores leading and trailing whitespace. – ShadowRanger Mar 08 '16 at 02:31
4

Here's some benchmark. It'll look strange but the crudest code wins.

[code]:

from collections import Counter, defaultdict
import io, time

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

infile = '/path/to/file'

def extract_dictionary_sklearn(file_path):
    with io.open(file_path, 'r', encoding='utf8') as fin:
        ngram_vectorizer = CountVectorizer(analyzer='word')
        X = ngram_vectorizer.fit_transform(fin)
        vocab = ngram_vectorizer.get_feature_names()
        counts = X.sum(axis=0).A1
    return Counter(dict(zip(vocab, counts)))

def extract_dictionary_native(file_path):
    dictionary = Counter()
    with io.open(file_path, 'r', encoding='utf8') as fin:
        for line in fin:
            dictionary.update(line.split())
    return dictionary

def extract_dictionary_paddle(file_path):
    dictionary = defaultdict(int)
    with io.open(file_path, 'r', encoding='utf8') as fin:
        for line in fin:
            for words in line.split():
                dictionary[word] +=1
    return dictionary

start = time.time()
extract_dictionary_sklearn(infile)
print time.time() - start

start = time.time()
extract_dictionary_native(infile)
print time.time() - start

start = time.time()
extract_dictionary_paddle(infile)
print time.time() - start

[out]:

38.306814909
24.8241138458
12.1182529926

Data size (154MB) used in the benchmark above:

$ wc -c /path/to/file
161680851

$ wc -l /path/to/file
2176141

Some things to note:

  • With the sklearn version, there's an overhead of vectorizer creation + numpy manipulation and conversion into a Counter object
  • Then native Counter update version, it seems like Counter.update() is an expensive operation
nat gillin
  • 51
  • 5
2

Instead of decoding the whole bytes read from the url, I process the binary data. Because bytes.translate expects its second argument to be a byte string, I utf-8 encode punctuation. After removing punctuations, I utf-8 decode the byte string.

The function freq_dist expects an iterable. That's why I've passed data.splitlines().

from urllib2 import urlopen
from collections import Counter
from string import punctuation
from time import time
import sys
from pprint import pprint

url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'

data = urlopen(url).read()

def freq_dist(data):
    """
    :param data: file-like object opened in binary mode or
                 sequence of byte strings separated by '\n'
    :type data: an iterable sequence
    """
    #For readability   
    #return Counter(word for line in data
    #    for word in line.translate(
    #    None,bytes(punctuation.encode('utf-8'))).decode('utf-8').split())

    punc = punctuation.encode('utf-8')
    words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
    return Counter(words)


start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(word_dist.most_common(10))

Output;

elapsed: 0.806480884552

[(u'de', 11106),
 (u'a', 6742),
 (u'que', 5701),
 (u'la', 4319),
 (u'je', 4260),
 (u'se', 3938),
 (u'\u043d\u0430', 3929),
 (u'na', 3623),
 (u'da', 3534),
 (u'i', 3487)]

It seems dict is more efficient than Counter object.

def freq_dist(data):
    """
    :param data: A string with sentences separated by '\n'
    :type data: str
    """
    d = {}
    punc = punctuation.encode('utf-8')
    words = (word for line in data for word in line.translate(None, punc).decode('utf-8').split())
    for word in words:
        d[word] = d.get(word, 0) + 1
    return d

start = time()
word_dist = freq_dist(data.splitlines())
print('elapsed: {}'.format(time() - start))
pprint(sorted(word_dist.items(), key=lambda x: (x[1], x[0]), reverse=True)[:10])

Output;

elapsed: 0.642680168152

[(u'de', 11106),
 (u'a', 6742),
 (u'que', 5701),
 (u'la', 4319),
 (u'je', 4260),
 (u'se', 3938),
 (u'\u043d\u0430', 3929),
 (u'na', 3623),
 (u'da', 3534),
 (u'i', 3487)]

To be more memory efficient when opening huge file, you have to pass just the opened url. But the timing will include file download time too.

data = urlopen(url)
word_dist = freq_dist(data)
Nizam Mohamed
  • 8,751
  • 24
  • 32
0

Skip CountVectorizer and scikit-learn.

The file may be too large to load into memory but I doubt the python dictionary gets too large. The easiest option for you may be to split the large file into 10-20 smaller files and extend your code to loop over the smaller files.

Stephen Grimes
  • 398
  • 1
  • 7
0

you can try with sklearn

from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer()

    data=['i am student','the student suffers a lot']
    transformed_data =vectorizer.fit_transform(data)
    vocab= {a: b for a, b in zip(vectorizer.get_feature_names(), np.ravel(transformed_data.sum(axis=0)))}
    print (vocab)
0

Combining every ones else's views and some of my own :) Here is what I have for you

from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

text='''Note that if you use RegexpTokenizer option, you lose 
natural language features special to word_tokenize 
like splitting apart contractions. You can naively 
split on the regex \w+ without any need for the NLTK.
'''

# tokenize
raw = ' '.join(word_tokenize(text.lower()))

tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
words = tokenizer.tokenize(raw)

# remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]

# count word frequency, sort and return just 20
counter = Counter()
counter.update(words)
most_common = counter.most_common(20)
most_common

Output

(All ones)

[('note', 1),
 ('use', 1),
 ('regexptokenizer', 1),
 ('option', 1),
 ('lose', 1),
 ('natural', 1),
 ('language', 1),
 ('features', 1),
 ('special', 1),
 ('word', 1),
 ('tokenize', 1),
 ('like', 1),
 ('splitting', 1),
 ('apart', 1),
 ('contractions', 1),
 ('naively', 1),
 ('split', 1),
 ('regex', 1),
 ('without', 1),
 ('need', 1)]

One can do better than this in terms of efficiency but if you are not worried about it too much, this code is the best.

Pradeep Singh
  • 432
  • 5
  • 11