Find the nth most common word and count in python

Question

I am a undergraduate student who is new here and loves programming. I meet a problem in practice and I want to ask for help here.

Given a string an integer n, return the nth most common word and it's count, ignore capitalization.

For the word, make sure all the letters are lowercase when you return it!

Hint: The split() function and dictionaries may be useful.

Example:

Input: "apple apple apple blue BlUe call", 2

Output: The list ["blue", 2]

My code is in the following:

from collections import Counter
def nth_most(str_in, n):
    split_it = str_in.split(" ")
    array = []
    for word, count in Counter(split_it).most_common(n):
        list = [word, count]
        array.append(count)
        array.sort()
        if len(array) - n <= len(array) - 1:
            c = array[len(array) - n]
            return [word, c]

The test result is like in the following:

Traceback (most recent call last):
  File "/grade/run/test.py", line 10, in test_one
    self.assertEqual(nth_most('apple apple apple blue blue call', 3), ['call', 1])
  File "/grade/run/bin/nth_most.py", line 10, in nth_most
    c = array[len(array) - n]
IndexError: list index out of range

As well as

Traceback (most recent call last):
  File "/grade/run/test.py", line 20, in test_negative
    self.assertEqual(nth_most('awe Awe AWE BLUE BLUE call', 1), ['awe', 3])
AssertionError: Lists differ: ['BLUE', 2] != ['awe', 3]

First differing element 0:
'BLUE'
'awe'

I don't know what's wrong with my code.

Thank you very much for your help!

@Jean-FrançoisFabre, the question is about finding the nth most common word. For the test case mentioned in the question, n=2, and blue occurs twice, hence is the output. — taurus05, Feb 13 '19 at 06:48
@Larry Chen, you may mark an answer that helped you solve your porblem. — DirtyBit, Feb 14 '19 at 07:11

Jean-François Fabre · Accepted Answer · 2019-02-13T07:07:18.453

Since you're using Counter, just use it wisely:

import collections

def nth_most(str_in, n):
    c = sorted(collections.Counter(w.lower() for w in str_in.split()).items(),key = lambda x:x[1])
    return(list(c[-n])) # convert to list as it seems to be the expected output

print(nth_most("apple apple apple blue BlUe call",2))

Build the word frequency dictionary, sort items according to values (2nd element of the tuple) and pick the nth last element.

This prints ['blue', 2].

What if there are 2 words with same frequency (tie) in first or second position ? This solution doesn't work. Instead, sort the number of occurrences, extract the nth most common occurrence, and run through the counter dict again to extract matches.

def nth_most(str_in, n):
    c = collections.Counter(w.lower() for w in str_in.split())
    nth_occs = sorted(c.values())[-n]
    return [[k,v] for k,v in c.items() if v==nth_occs]

print(nth_most("apple apple apple call blue BlUe call woot",2))

this time it prints:

[['call', 2], ['blue', 2]]

is there any performance benefit by using `c.most_common()` instead of `sorted` — Sandeep, Feb 13 '19 at 07:13
`most_common` probably does the same thing, but it doesn't output the required answer so needs some post-processing to filter out the elements, specially if there are ties — Jean-François Fabre, Feb 13 '19 at 08:02

score 3 · Answer 2 · answered Feb 13 '19 at 07:33

3

Counter return most commune elements in order so you can do like:

list(Counter(str_in.lower().split()).most_common(n)[-1]) # n is nth most common word

answered Feb 13 '19 at 07:33

kederrac

16,819
6
32
55

1

Why not `Counter(s.lower().split()).most_common()[n-1]` ? Also `from collections import Counter`. – dani herrera Feb 13 '19 at 07:45
if you use most_common()[n-1] you will use an O(nlogn) algorithm , if you use most_common(k) you will use O(nlogk) algorithm (check this [link](https://stackoverflow.com/questions/29240807/python-collections-counter-most-common-complexity)) – kederrac Feb 13 '19 at 07:59
as a good practice in python, do not reinvent the wheel, specially when it comes to python standard library – kederrac Feb 13 '19 at 08:11

DirtyBit · Answer 3 · 2019-02-13T07:15:05.033

def nth_common(lowered_words, check):
    m = []
    for i in lowered_words:
        m.append((i, lowered_words.count(i)))
    for i in set(m):
        # print(i)
        if i[1] == check: # check if the first index value (occurrance) of tuple == check
            print(i, "found")
    del m[:] # deleting list for using it again


words = ['apple', 'apple', 'apple', 'blue', 'BLue', 'call', 'cAlL']
lowered_words = [x.lower() for x in words]   # ignoring the uppercase
check = 2   # the check

nth_common(lowered_words, check)

OUTPUT:

('blue', 2) found
('call', 2) found

score 1 · Answer 4 · edited Feb 13 '19 at 07:02

1

Traceback (most recent call last):
  File "/grade/run/test.py", line 10, in test_one
    self.assertEqual(nth_most('apple apple apple blue blue call', 3), ['call', 1])
  File "/grade/run/bin/nth_most.py", line 10, in nth_most
    c = array[len(array) - n]
IndexError: list index out of range

to solve this list out of index error, just put

maxN = 1000 #change according to your max length
array = [ 0 for _ in range( maxN ) ]

edited Feb 13 '19 at 07:02

Stephen Rauch

47,830
31
106
135

answered Feb 13 '19 at 06:50

Nehhal Kalnad

111
4

just `array = [0] * maxN` – Jean-François Fabre Oct 16 '22 at 10:17

Pradeep Pandey · Answer 5 · 2019-02-13T08:31:20.063

0

Even you can get without Collection module: paragraph='Nory was a Catholic because her mother was a Catholic, and Nory’s mother was a Catholic because her father was a Catholic, and her father was a Catholic because his mother was a Catholic, or had been'

def nth_common(n,p):
    words=re.split('\W+',p.lower())
    word_count={}
    counter=0
    for i in words:
        if i in word_count:
            word_count[i]+=1
        else:
            word_count[i]=1

    sorted_count = sorted(word_count.items(), key=lambda x: x[1],reverse=True)         

    return sorted_count[n-1]
nth_common(3,paragraph)

output will be ('catholic', 6)

sorted(based on count) word count output: [('was', 6), ('a', 6), ('catholic', 6), ('because', 3), ('her', 3), ('mother', 3), ('nory', 2), ('and', 2), ('father', 2), ('s', 1), ('his', 1), ('or', 1), ('had', 1), ('been', 1)]

edited Feb 13 '19 at 08:31

answered Feb 13 '19 at 07:28

Pradeep Pandey

307
2
7

This would return `('catholic', 3)` which is incorrect since the word `catholic` came 6 times. The correct output should have been `('mother', 3)` – DirtyBit Feb 13 '19 at 07:31
split has taken on space and other catholic are having , with them that's why it considered them as separate word, replace p.lower.split() with re.split('\W+',p.lower()) then catholic will have count as 6 since in this example there are three top words which has count 6 to it takes one of them – Pradeep Pandey Feb 13 '19 at 08:29

Find the nth most common word and count in python

5 Answers5

Linked