0

I need to create a document - term matrix of a large amount of texts. After I created it (one word per column), I will need to standardize all words' frequencies and to eventually sum them. Still, I stuck at the beginning:

Suppose that my example is: speech=[['7/10/2016', 'cat','dog', 'I have a speech to be stemmed here'], ['6/10/2016', 'dog', 'mouse', 'Here is another text']]

df = pd.DataFrame.from_records(
((r[0], r[1], r[2], r[3]) for r in speech),
columns=["Date", "Name", "Surname", "Speech"])

Here, I have this DataFrame which has a speech in each row[3]. I need to first stem the data contained in "speech", and then to create the dtm. I know how to stem the data when I have a list of lists, but I am not able to handle Dataframes.

Finally, could you give me some clue on how to standardize columns and sum them (to get the aggregate standardized frequency of words in a text)?

techraf
  • 64,883
  • 27
  • 193
  • 198
  • you should provide us with an example input and a desired ouput. also you should show us what you tried. [see this](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Steven G Oct 07 '16 at 18:00
  • I provided an example input. The desired output is a document term matrix, not with entire words as columns, but as stemmed words. Is it clearer now? –  Oct 07 '16 at 18:07
  • Why did you create a new user account for this question, only a few hours after [your other question](http://stackoverflow.com/questions/39915552)? – lenz Oct 07 '16 at 18:31

0 Answers0