I need to create a document - term matrix of a large amount of texts. After I created it (one word per column), I will need to standardize all words' frequencies and to eventually sum them. Still, I stuck at the beginning:
Suppose that my example is: speech=[['7/10/2016', 'cat','dog', 'I have a speech to be stemmed here'], ['6/10/2016', 'dog', 'mouse', 'Here is another text']]
df = pd.DataFrame.from_records(
((r[0], r[1], r[2], r[3]) for r in speech),
columns=["Date", "Name", "Surname", "Speech"])
Here, I have this DataFrame which has a speech in each row[3]. I need to first stem the data contained in "speech", and then to create the dtm. I know how to stem the data when I have a list of lists, but I am not able to handle Dataframes.
Finally, could you give me some clue on how to standardize columns and sum them (to get the aggregate standardized frequency of words in a text)?