1

I am trying to break down the text column of a dataframe, and get the top words broken down per row/document. I have the top words, in this example it is machine and learning both at counts of 8. However I'm unsure how to break down the top words per document instead of the whole dataframe.

Below are the results for the top words for the dataframe as a whole:

machine 8

learning 8

important 2

think 1

significant 1

import pandas as pd
y = ['machine learning. i think machine learning rather significant machine learning',
     'most important aspect is machine learning. machine learning very important essential',
    'i believe machine learning great, machine learning machine learning']
x = ['a','b','c']
practice = pd.DataFrame(data=y,index=x,columns=['text'])

What I am expecting is next to the text column, is another column that indicates the top word. For Example for the word 'Machine' the dataframe should look like:

a / … / 3

b / … / 2

c / … / 3

  • 1
    Have you looked at any of the many NLP questions out there that deal with this general topic? For example, [this](https://stackoverflow.com/questions/18936957/count-distinct-words-from-a-pandas-data-frame), [this](https://stackoverflow.com/questions/47597738/word-count-of-single-column-in-pandas-dataframe?noredirect=1&lq=1), or [this](https://cmdlinetips.com/2018/02/how-to-get-frequency-counts-of-a-column-in-pandas-dataframe/)? In short _what have you researched already and why was it insufficient?_ – mayosten Oct 08 '19 at 22:05

1 Answers1

2

You can perform the following using the Counter from the collections module.

import pandas as pd
from collections import Counter
y = ['machine learning. i think machine learning rather significant machine learning',
     'most important aspect is machine learning. machine learning very important essential',
    'i believe machine learning great, machine learning machine learning']
x = ['a','b','c']
practice = pd.DataFrame(data=y,index=x,columns=['text'])


word_frequency = []

for line in practice["text"]:
    words = line.split()     #this will create a list of all the words in each line
    words_counter = Counter(words)    #This will count the words and number of occurances
    top_word = words_counter.most_common(1)[0][1]    #return the number of the first most frequent word in the list
    word_frequency.append(top_word)     #append the word to the empty list

practice["Word Frequency"] = word_frequency     #add the list as a new column in the dataframe
print(practice)

Please refer to the Counter documentation for more details https://docs.python.org/2/library/collections.html#collections.Counter

RamWill
  • 288
  • 1
  • 3
  • 6