-1

I am trying to find the frequency of the words in a .txt file and enrich it by sorting the number of occurrences of each word.

So far, I completed %90 of the task. What is left is to sort the number of occurrences in descending order.

Here is my code:

def frequency_check(lines):
    print("Frequency of words in file")
    words = re.findall(r"\w+", lines)
    item_list = []

    for item in words:
        if item not in item_list:
            item_count = words.count(item)
            print("{} : {} times".format(item, item_count))
            item_list.append(item)


with open("original-3.txt", 'r') as file1:
    lines = file1.read().lower()
    frequency_check(lines)

This is the .txt file on which I am finding the word frequency,

enter image description here

Here's the output I get:

Frequency of words in file
return : 2 times
all : 1 times
non : 1 times
overlapping : 1 times
matches : 3 times
of : 5 times
pattern : 3 times
in : 4 times
string : 2 times
as : 1 times
a : 3 times
list : 3 times
strings : 1 times
the : 6 times
is : 1 times
scanned : 1 times
left : 1 times
to : 1 times
right : 1 times
and : 1 times
are : 3 times
returned : 1 times
order : 1 times
found : 1 times
if : 2 times
one : 2 times
or : 1 times
more : 2 times
groups : 2 times
present : 1 times
this : 1 times
will : 1 times
be : 1 times
tuples : 1 times
has : 1 times
than : 1 times
group : 1 times
empty : 1 times
included : 1 times
result : 1 times
unless : 1 times
they : 1 times
touch : 1 times
beginning : 1 times
another : 1 times
match : 1 times

Process finished with exit code 0

It would be a great challenge to sort these and output from highest number of occurrences to lowest.

PS:I thought about using dictionaries, however, dictionaries are immutable and I can't use sort method on them

Any ideas?

Thank you very much

Burakhan Aksoy
  • 319
  • 3
  • 14

2 Answers2

3

I agree with @lolu that you should use a dictionary but if you still want to use a list, here is a solution:

import re


def frequency_check(lines):
    print("Frequency of words in file")
    words = re.findall(r"\w+", lines)
    unique_words = set(words)
    item_list = []

    for item in unique_words:
        item_count = words.count(item)
        item_list.append((item, item_count))

    item_list.sort(key=lambda t: (t[1], t[0]), reverse=True)
    for item, item_count in item_list:
        print("{} : {} times".format(item, item_count))


with open("original-3.txt", 'r') as file1:
    lines = file1.read().lower()
    frequency_check(lines)

And a much better implementation using collections.Counter:

import re
from collections import Counter


def frequency_check(lines):
    print("Frequency of words in file")
    words = re.findall(r"\w+", lines)
    word_counts = Counter(words)
    for item, item_count in word_counts.most_common():
        print("{} : {} times".format(item, item_count))


with open("original-3.txt", 'r') as file1:
    lines = file1.read().lower()
    frequency_check(lines)
Asocia
  • 5,935
  • 2
  • 21
  • 46
1

I still think you should have used a dictionary. they are mutable. However, for your exact output, you can use the "sorted" function, that works on lists as well as on a dictionary.

for your current list the way you put it:

lst = ["order : 1 times", "returned : 3 times"]   
new_lst = sorted(lst, key = lambda x : x.split(" ")[2])

notice that your integer value is in the 2nd index when u use split the way I did.

sorted gives you back a list. if you want to use the current list you are using, you could also use the function "sort" all lists have:

lst.sort(key=lambda x: x.split(" ")[2])

If you choose to switch this to a directory, notice in my example, the key is the word and the value is the counts, you'll be able to use this instead:

xs = {"order":3, "and":15}
sorted(xs.items(), key=lambda x: x[1])
lolu
  • 370
  • 4
  • 20