I have a set of texts contained in a list, which I loaded from a csv file
texts=['this is text1', 'this would be text2', 'here we have text3']
and I would like to create a document-term matrix, by using stemmed words. I have also stemmed them to have:
[['text1'], ['would', 'text2'], ['text3']]
What I would like to do is to create a DTM that counts all the stemmed terms (then I would need to do some operations on the rows).
For what concerns the unstemmed texts, I am able to make the DTM for short texts, by using the function fn_tdm_df reported here. What would be more practical for me, though, is to make a DTM of the stemmed words. Just to be clearer, the output I have from applying "fn_tdm_df":
be have here is text1 text2 text3 this we would
0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1 1.0 1.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0
First, I do not know why I have only two rows, instead of three. Second, my desired output would be something like:
text1 would text2 text3
0 1 0 0 0
1 0 1 1 0
2 0 0 0 1
I am sorry but I am really desperate on this output. I also tried to export and reimport the stemmed texts on R, but it doesn't encode correctly. I would probably need to handle DataFrames, as for the huge amount of data. What would you suggest me?
----- UPDATE
Using CountVectorizer I am not fully satisfied, as I do not get a tractable matrix in which I can normalize and sum rows/columns easily.
Here is the code I am using, but it is blocking Python (dataset too large). How can I run it efficiently?
vect = CountVectorizer(min_df=0., max_df=1.0)
X = vect.fit_transform(texts)
print(pd.DataFrame(X.A, columns=vect.get_feature_names()).to_string())
df = pd.DataFrame(X.toarray().transpose(), index = vect.get_feature_names())