Create large document term matrix from a Data Frame

Question

I need to create a document - term matrix of a large amount of texts. After I created it (one word per column), I will need to standardize all words' frequencies and to eventually sum them. Still, I stuck at the beginning:

Suppose that my example is: speech=[['7/10/2016', 'cat','dog', 'I have a speech to be stemmed here'], ['6/10/2016', 'dog', 'mouse', 'Here is another text']]

df = pd.DataFrame.from_records(
((r[0], r[1], r[2], r[3]) for r in speech),
columns=["Date", "Name", "Surname", "Speech"])

Here, I have this DataFrame which has a speech in each row[3]. I need to first stem the data contained in "speech", and then to create the dtm. I know how to stem the data when I have a list of lists, but I am not able to handle Dataframes.

Finally, could you give me some clue on how to standardize columns and sum them (to get the aggregate standardized frequency of words in a text)?

you should provide us with an example input and a desired ouput. also you should show us what you tried. [see this](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — Steven G, Oct 07 '16 at 18:00
I provided an example input. The desired output is a document term matrix, not with entire words as columns, but as stemmed words. Is it clearer now? — , Oct 07 '16 at 18:07
Why did you create a new user account for this question, only a few hours after [your other question](http://stackoverflow.com/questions/39915552)? — lenz, Oct 07 '16 at 18:31

Create large document term matrix from a Data Frame

0 Answers0