1

I have a dataframe with a list of products and its respective review

+---------+------------------------------------------------+
| product | review |
+---------+------------------------------------------------+
| product_a | It's good for a casual lunch |
+---------+------------------------------------------------+
| product_b | Avery is one of the most knowledgable baristas |
+---------+------------------------------------------------+
| product_c | The tour guide told us the secrets |
+---------+------------------------------------------------+

How can I get all the unique words in the data frame?

I made a function:

def count_words(text):
    try:
        text = text.lower()
        words = text.split()
        count_words = Counter(words)
    except Exception, AttributeError:
        count_words = {'':0}
    return count_words

And applied the function to the DataFrame, but that only gives me the words count for each row.

reviews['words_count'] = reviews['review'].apply(count_words)
Luis Ramon Ramirez Rodriguez
  • 9,591
  • 27
  • 102
  • 181

1 Answers1

10

Starting with this:

dfx
               review
0      United Kingdom
1  The United Kingdom
2     Dublin, Ireland
3    Mardan, Pakistan

To get all words in the "review" column:

 list(dfx['review'].str.split(' ', expand=True).stack().unique())

   ['United', 'Kingdom', 'The', 'Dublin,', 'Ireland', 'Mardan,', 'Pakistan']

To get counts of "review" column:

dfx['review'].str.split(' ', expand=True).stack().value_counts()


United      2
Kingdom     2
Mardan,     1
The         1
Ireland     1
Dublin,     1
Pakistan    1
dtype: int64    ​
Merlin
  • 24,552
  • 41
  • 131
  • 206