Create a DTM from large corpus

Question

I have a set of texts contained in a list, which I loaded from a csv file

texts=['this is text1', 'this would be text2', 'here we have text3']

and I would like to create a document-term matrix, by using stemmed words. I have also stemmed them to have:

[['text1'], ['would', 'text2'], ['text3']]

What I would like to do is to create a DTM that counts all the stemmed terms (then I would need to do some operations on the rows).

For what concerns the unstemmed texts, I am able to make the DTM for short texts, by using the function fn_tdm_df reported here. What would be more practical for me, though, is to make a DTM of the stemmed words. Just to be clearer, the output I have from applying "fn_tdm_df":

  be  have  here   is  text1  text2  text3  this   we  would
0  1.0   1.0   1.0  1.0    1.0    1.0    1.0     1  1.0    1.0
1  0.0   0.0   0.0  0.0    0.0    0.0    0.0     1  0.0    0.0

First, I do not know why I have only two rows, instead of three. Second, my desired output would be something like:

  text1  would  text2  text3
0   1      0      0      0
1   0      1      1      0
2   0      0      0      1

I am sorry but I am really desperate on this output. I also tried to export and reimport the stemmed texts on R, but it doesn't encode correctly. I would probably need to handle DataFrames, as for the huge amount of data. What would you suggest me?

----- UPDATE

Using CountVectorizer I am not fully satisfied, as I do not get a tractable matrix in which I can normalize and sum rows/columns easily.

Here is the code I am using, but it is blocking Python (dataset too large). How can I run it efficiently?

vect = CountVectorizer(min_df=0., max_df=1.0)
X = vect.fit_transform(texts)
print(pd.DataFrame(X.A, columns=vect.get_feature_names()).to_string())
df = pd.DataFrame(X.toarray().transpose(), index = vect.get_feature_names())

score 2 · Answer 1 · answered Oct 08 '16 at 14:31

2

Why don't you use sklearn? The CountVectorizer() method converts a collection of text documents to a matrix of token counts. What's more it gives a sparse representation of the counts using scipy.

You can either give your raw entries to the method or preprocess it as you have done (stemming + stop words).

Check this out : CountVectorizer()

answered Oct 08 '16 at 14:31

MMF

5,750
3
16
20

That is a good idea, indeed. I tried, but it gave me a "confusing" output, that is, I would like to have a tractable matrix in which I can sum columns and create new of them. That is the code I am implementing following your hint (see the edited post). – dnquixote Oct 08 '16 at 15:13
the `fit_transform()` method returns an `array`. You can transform it to a `DataFrame` using pandas. Then you'll be able to do what ever you want with your data. – MMF Oct 08 '16 at 15:29
That is fine, indeed. However, Python is blocking it while running my code (above), that coincides with what you suggested. Any hint on how to make it run smoothly? – dnquixote Oct 08 '16 at 16:44
Where does it block ? Did you try to reduce the number of words you have ? Use `stop_words = 'english' ` as a parameter of your `CountVectorizer`. See if it does not prune the dataset enough first. – MMF Oct 08 '16 at 17:04
I stem and remove punctuation before that. It blocks when computing df, in the last line. I do the stemming before the lines of code posted. How can I reduce a bit the matrix? – dnquixote Oct 08 '16 at 17:07
Use `stop_words = 'english`` in your `CountVectorizer`. It takes off `stop words` encountered in the english language. It seems that you did not take them off – MMF Oct 08 '16 at 17:09
I am sorry I do not have enough reputation to chat, but it does not work at all. There is no punctuation, and it blocks when doing last line. – dnquixote Oct 08 '16 at 17:13
Use this instead `vect = CountVectorizer(min_df=0., max_df=1.0, stop_words='english')` – MMF Oct 08 '16 at 17:15
Unfortunately it crashes. I had to restart my computer. – dnquixote Oct 08 '16 at 17:49

Create a DTM from large corpus

1 Answers1

Linked