Most common n words in a text

Question

I am currently learning to work with NLP. One of the problems I am facing is finding most common n words in text. Consider the following:

text=['Lion Monkey Elephant Weed','Tiger Elephant Lion Water Grass','Lion Weed Markov Elephant Monkey Fine','Guard Elephant Weed Fortune Wolf']

Suppose n = 2. I am not looking for most common bigrams. I am searching for 2-words that occur together the most in the text. Like, the output for the above should give:

'Lion' & 'Elephant': 3 'Elephant' & 'Weed': 3 'Lion' & 'Monkey': 2 'Elephant' & 'Monkey': 2

and such..

Could anyone suggest a suitable way to tackle this?

I don't know much about NLP, but basket-analysis could be something that might do the trick. You can consider each sentence a basket and each word an item — CutePoison, Aug 14 '20 at 09:15
Two reference threads, [this one](https://stackoverflow.com/questions/4634787) and [the other one](https://stackoverflow.com/questions/19145332). — bad_coder, Aug 14 '20 at 09:20
Loop and list slicing would be time consuming, when working with large volumes, right? — DareDevilNoob, Aug 14 '20 at 09:39
Does this answer your question? [Co-occurrence Matrix from list of words in Python](https://stackoverflow.com/questions/42814452/co-occurrence-matrix-from-list-of-words-in-python) — Riccardo Bucco, Aug 14 '20 at 10:14

score 1 · Answer 1 · answered Aug 14 '20 at 10:10

1

it was tricky but I solved for you, I used empty space to detect if elem contains more than 3 words :-) cause if elem has 3 words then it must be 2 empty spaces :-) in that case, only elem with 2 words will be returned

l = ["hello world", "good night world", "good morning sunshine", "wassap babe"]
for elem in l:

   if elem.count(" ") == 1:
      print(elem)

output

hello world
wassap babe

answered Aug 14 '20 at 10:10

my bad, i did not understand correctly your request. well you can use string.count() and format the output to return your desired output – Aug 14 '20 at 10:17

sabacherli · Accepted Answer · 2020-08-14T14:44:17.760

I would suggest using Counter and combinations as follows.

from collections import Counter
from itertools import combinations, chain

text = ['Lion Monkey Elephant Weed', 'Tiger Elephant Lion Water Grass', 'Lion Weed Markov Elephant Monkey Fine', 'Guard Elephant Weed Fortune Wolf']


def count_combinations(text, n_words, n_most_common=None):
    count = []
    for t in text:
        words = t.split()
        combos = combinations(words, n_words)
        count.append([" & ".join(sorted(c)) for c in combos])
    return dict(Counter(sorted(list(chain(*count)))).most_common(n_most_common))

count_combinations(text, 2)

Most common n words in a text

2 Answers2