I have the following sample dataframe:
No category problem_definition
175 2521 ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438 ['galley', 'work', 'table', 'stuck']
912 2698 ['cloth', 'stuck']
572 2521 ['stuck', 'coffee']
The problem_definition field has already been tokenized with stop gap words removed.
I want to create a frequency distribution that outputs another Pandas dataframe:
1) with the frequency occurrence of each word in problem_definition 2) with the frequency occurrence of each word in problem_definition by category field
Sample desired output below for case 1):
text count
coffee 2
maker 1
brewing 1
properly 1
2 1
420 3
stuck 3
galley 1
work 1
table 1
cloth 1
Sample desired output below for case 2):
category text count
2521 coffee 2
2521 maker 1
2521 brewing 1
2521 properly 1
2521 2 1
2521 420 3
2521 stuck 1
1438 galley 1
1438 work 1
1438 table 1
1438 stuck 1
2698 cloth 1
2698 stuck 1
I tried the following code to accomplish 1):
from nltk.probability import FreqDist
import pandas as pd
fdist = FreqDist(df['problem_definition_stopwords'])
TypeError: unhashable type: 'list'
I have no idea how to do 2)