Frequency of keywords in a list

Question

Hi so i have 2 text files I have to read the first text file count the frequency of each word and remove duplicates and create a list of list with the word and its count in the file.

My second text file contains keywords I need to count the frequency of these keywords in the first text file and return the result without using any imports, dict, or zips.

I am stuck on how to go about this second part I have the file open and removed punctuation etc but I have no clue how to find the frequency I played around with the idea of .find() but no luck as of yet.

Any suggestions would be appreciated this is my code at the moment seems to find the frequency of the keyword in the keyword file but not in the first text file

def calculateFrequenciesTest(aString):

   listKeywords= aString
   listSize = len(listKeywords)
   keywordCountList = []

   while listSize > 0:
      targetWord = listKeywords [0]
      count =0
      for i in range(0,listSize):
         if targetWord == listKeywords [i]:
            count = count +1

      wordAndCount = []
      wordAndCount.append(targetWord)
      wordAndCount.append(count)

      keywordCountList.append(wordAndCount)

      for i in range (0,count):
         listKeywords.remove(targetWord)
      listSize = len(listKeywords)

   sortedFrequencyList = readKeywords(keywordCountList)

   return keywordCountList;

EDIT- Currently toying around with the idea of reopening my first file again but this time without turning it into a list? I think my errors are somehow coming from it counting the frequency of my list of list. These are the types of results I am getting

[[['the', 66], 1], [['of', 32], 1], [['and', 27], 1], [['a', 23], 1], [['i', 23], 1]]

What you can do is, go through each of your keyword and if that keyword exists in your `frequency list`, then just increment it at that index. — RPT, May 21 '17 at 11:54
this is basically what i wanted to do but i tried a few different ways and it wasn't working :s — Jessica Issanchon, May 21 '17 at 12:09

RPT · Answer 1 · 2017-05-21T12:45:50.683

You can try something like:

I am taking a list of words as an example.

word_list = ['hello', 'world', 'test', 'hello']
frequency_list = {}
for word in word_list:
    if word not in frequency_list:
        frequency_list[word] = 1
    else:
        frequency_list[word] += 1
print(frequency_list)

RESULT: {'test': 1, 'world': 1, 'hello': 2}

Since, you have put a constraint on dicts, I have made use of two lists to do the same task. I am not sure how efficient it is, but it serves the purpose.

word_list = ['hello', 'world', 'test', 'hello']
frequency_list = []
frequency_word = []
for word in word_list:
    if word not in frequency_word:
        frequency_word.append(word)
        frequency_list.append(1)
    else:
        ind = frequency_word.index(word)
        frequency_list[ind] += 1

print(frequency_word)
print(frequency_list)

RESULT : ['hello', 'world', 'test']
         [2, 1, 1]

You can change it to how you like or re-factor it as you wish

This is easier done with [`collections.Counter`](https://docs.python.org/3.6/library/collections.html#collections.Counter) — bereal, May 21 '17 at 12:20

score 0 · Answer 2 · edited May 23 '17 at 12:34

I agree with @bereal that you should use Counter for this. I see that you have said that you don't want "imports, dict, or zips", so feel free to disregard this answer. Yet, one of the major advantages of Python is its great standard library, and every time you have list available, you'll also have dict, collections.Counter and re.

From your code I'm getting the impression that you want to use the same style that you would have used with C or Java. I suggest trying to be a little more pythonic. Code written this way may look unfamiliar, and can take time getting used to. Yet, you'll learn way more.

Claryfying what you're trying to achieve would help. Are you learning Python? Are you solving this specific problem? Why can't you use any imports, dict, or zips?

So here's a proposal utilizing built in functionality (no third party) for what it's worth (tested with Python 2):

#!/usr/bin/python

import re           # String matching
import collections  # collections.Counter basically solves your problem


def loadwords(s):
    """Find the words in a long string.

    Words are separated by whitespace. Typical signs are ignored.

    """
    return (s
            .replace(".", " ")
            .replace(",", " ")
            .replace("!", " ")
            .replace("?", " ")
            .lower()).split()


def loadwords_re(s):
    """Find the words in a long string.

    Words are separated by whitespace. Only characters and ' are allowed in strings.

    """
    return (re.sub(r"[^a-z']", " ", s.lower())
            .split())


# You may want to read this from a file instead
sourcefile_words = loadwords_re("""this is a sentence. This is another sentence.
Let's write many sentences here.
Here comes another sentence.
And another one.
In English, we use plenty of "a" and "the". A whole lot, actually.
""")

# Sets are really fast for answering the question: "is this element in the set?"
# You may want to read this from a file instead
keywords = set(loadwords_re("""
of and a i the
"""))

# Count for every word in sourcefile_words, ignoring your keywords
wordcount_all = collections.Counter(sourcefile_words)

# Lookup word counts like this (Counter is a dictionary)
count_this = wordcount_all["this"] # returns 2
count_a = wordcount_all["a"] # returns 1

# Only look for words in the keywords-set
wordcount_keywords = collections.Counter(word
                                         for word in sourcefile_words
                                         if word in keywords)

count_and = wordcount_keywords["and"] # Returns 2
all_counted_keywords = wordcount_keywords.keys() # Returns ['a', 'and', 'the', 'of']

score 0 · Answer 3 · answered May 22 '17 at 23:37

Here is a solution with no imports. It uses nested linear searches which are acceptable with a small number of searches over a small input array, but will become unwieldy and slow with larger inputs.

Still the input here is quite large, but it handles it in reasonable time. I suspect if your keywords file was larger (mine has only 3 words) the slow down would start to show.

Here we take an input file, iterate over the lines and remove punctuation then split by spaces and flatten all the words into a single list. The list has dupes, so to remove them we sort the list so the dupes come together and then iterate over it creating a new list containing the string and a count. We can do this by incrementing the count as long the same word appears in the list and moving to a new entry when a new word is seen.

Now you have your word frequency list and you can search it for the required keyword and retrieve the count.

The input text file is here and the keyword file can be cobbled together with just a few words in a file, one per line.

python 3 code, it indicates where applicable how to modify for python 2.

# use string.punctuation if you are somehow allowed 
# to import the string module.
translator = str.maketrans('', '', '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')

words = []
with open('hamlet.txt') as f:
    for line in f:
        if line:
            line = line.translate(translator)
            # py 2 alternative
            #line = line.translate(None, string.punctuation)
            words.extend(line.strip().split())

# sort the word list, so instances of the same word are
# contiguous in the list and can be counted together
words.sort()

thisword = ''
counts = []

# for each word in the list add to the count as long as the 
# word does not change
for w in words:
    if w != thisword:
        counts.append([w, 1])
        thisword = w
    else:
        counts[-1][1] += 1

for c in counts:
    print('%s (%d)' % (c[0], c[1]))

# function to prevent need to break out of nested loop
def findword(clist, word):
    for c in clist:
        if c[0] == word:
            return c[1]
    return 0   

# open keywords file and search for each word in the 
# frequency list.
with open('keywords.txt') as f2:
    for line in f2:
        if line:
            word = line.strip()
            thiscount = findword(counts, word)
            print('keyword %s appear %d times in source' % (word, thiscount))

If you were so inclined you could modify findword to use a binary search, but its still not going to be anywhere near a dict. collections.Counter is the right solution when there are no restrictions. Its quicker and less code.

Frequency of keywords in a list

3 Answers3