3

I have Pandas dataframe with one text column. I want to count what phrases are the most common in this column. For example, from the text, you can see that phrases like a very good movie, last night etc. appears a lot of time. I think that there is a way of defining n-grams, for example that phrase is between 3 and 5 words, but I do not know how to do that.

import pandas as pd


text = ['this is a very good movie that we watched last night',
        'i have watched a very good movie last night',
        'i love this song, its amazing',
        'what should we do if he asks for it',
        'movie last night was amazing',
        'a very nice song was played',
        'i would like to se a good show',
        'a good show was on tv last night']

df = pd.DataFrame({"text":text})
print(df)

So my goal is to rank the phrases (3-5 words) that appears a lot of times

taga
  • 3,537
  • 13
  • 53
  • 119

1 Answers1

5

First split text in list comprehension and flatten to vals, then create ngrams, pass to Series and last use Series.value_counts:

from nltk import ngrams
vals = [y for x in df['text'] for y in x.split()]

n = [3,4,5]
a = pd.Series([y for x in n for y in ngrams(vals, x)]).value_counts()
print (a)
(a, good, show)                      2
(movie, last, night)                 2
(a, very, good)                      2
(last, night, i)                     2
(a, very, good, movie)               2
                                    ..
(should, we, do)                     1
(a, very, nice, song, was)           1
(asks, for, it, movie, last)         1
(this, song,, its, amazing, what)    1
(i, have, watched, a)                1
Length: 171, dtype: int64

Or if tuples should be joined by space:

n = [3,4,5]
a = pd.Series([' '.join(y) for x in n for y in ngrams(vals, x)]).value_counts()
print (a)
last night i                  2
a good show                   2
a very good movie             2
very good movie               2
movie last night              2
                             ..
its amazing what should       1
watched last night i have     1
to se a                       1
very good movie last night    1
a very nice song was          1
Length: 171, dtype: int64

Another idea with Counter:

from nltk import ngrams
from collections import Counter

vals = [y for x in df['text'] for y in x.split()]
c = Counter([' '.join(y) for x in [3,4,5] for y in ngrams(vals, x)])

df1 = pd.DataFrame({'ngrams': list(c.keys()),
                   'count': list(c.values())})
print (df1)
                   ngrams  count
0               this is a      1
1               is a very      1
2             a very good      2
3         very good movie      2
4         good movie that      1
..                    ...    ...
166  show a good show was      1
167    a good show was on      1
168   good show was on tv      1
169   show was on tv last      1
170  was on tv last night      1

[171 rows x 2 columns]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • can you just please explain me this with words : a = pd.Series([y for x in n for y in ngrams(vals, x)]).value_counts() – taga Feb 03 '20 at 11:31
  • @taga - Sure, it is [flatten lists](https://stackoverflow.com/a/952952), it means I create 3 times ngrams and output is list of tuples. – jezrael Feb 03 '20 at 11:38
  • 1
    @taga - if use nested list comprehension without flatten like `c = [[y for y in ngrams(vals, x)] for x in n]` get list of lists of tuples and solution fail, because need list of tuples – jezrael Feb 03 '20 at 11:41