0

I am building a word frequency, and relative frequency, list(s) for a collection of text files. Having discovered, by hand, that a couple of texts can overly influence the frequency of a word, one of the things I want to be able to do is count the number of times a word occurs. It strikes me that there are two ways to do this:

First, to compile a word frequency dictionary (as below -- and I'm not using the NLTK FreqDist because this code actually runs more quickly but if FreqDist has the above functionality built-in and I just didn't know it, I'll take it):

import nltk

tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

freq_dic = {}
for text in ftexts:
    words = tokenizer.tokenize(text)
    for word in words:
        # form dictionary
        try: 
            freq_dic[word] += 1
        except: 
            freq_dic[word] = 1

From there, I assume I'll need to write another loop that uses the keys above as keywords:

# This is just scratch code
for text in ftexts:
    while True:
        if keyword not in line:
            continue
        else:
            break
    count = count + 1

And then I'll find some way to mesh these two dictionaries into a tuple or, possibly, a pandas dataframe by word, such that:

word1, frequency, # of texts in which it occurs
word2, frequency, # of texts in which it occurs

The other thing that occurred to me as I was writing this question was to use SciKit's term frequency matrix and then count rows in which a word occurs? Is that possible?

ADDED TO CLARIFY:

Imagine three sentences: ["I need to keep count of the children.", "If you want to know what the count is, just ask." "There is nothing here but chickens, chickens, chickens."]

"count" occurs 2x but is in two different texts; "chickens" occurs three times, but is in only one text. What I want is a report that looks like this:

WORD, FREQ, TEXTS
count, 2, 2
chicken, 3, 1
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
John Laudun
  • 407
  • 1
  • 9
  • 19
  • 1
    Possible duplicate of [How to count the occurrences of a list item?](https://stackoverflow.com/questions/2600191/how-to-count-the-occurrences-of-a-list-item) – handle Apr 12 '18 at 21:37
  • Check out [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) which does exactly what you want. – Vivek Kumar Apr 16 '18 at 15:53
  • Thank you, @VivekKumar. I fired up `sklearn` and I figured out what I needed. – John Laudun Jun 01 '18 at 20:43

0 Answers0