2

I have a collection of ~100,000 documents in a dataset with a unique doc_id and four columns containing text (like below).

original dataset

I want to vectorize each of the four text columns individually and then combine all of those features back together to create one large dataset for the purpose of building a model for prediction. I approached the vectorization for each text feature using code like below:

stopwords = nltk.corpus.stopwords.words("english")

subject_transformer = CountVectorizer(stop_words=stopwords)
subject_vectorized = subject_transformer.fit_transform(full_docs['subject'])

body_transformer = CountVectorizer(stop_words=stopwords)
body_vectorized = body_transformer.fit_transform(full_docs['body'])

excerpt_transformer = CountVectorizer(stop_words=stopwords)
excerpt_vectorized = excerpt_transformer.fit_transform(full_docs['excerpt'])

regex_transformer = CountVectorizer(stop_words=stopwords)
regex_vectorized = regex_transformer.fit_transform(full_docs['regex'])

Each vectorization yields a sparse matrix like below where column one is the document number, column two is the column number (one for each word in the original text column), and the last column is the actual count.

sparse matrix

What I want to do is the following:

  1. Transpose each sparse matrix to a full dataframe of dimensions nxp where n=number of documents & p=number of words in that corpus
  2. Merge each of these matrices/dataframes back together for the purpose of building a model for prediction

I initially tried the following:

regex_vectorized_df = pd.DataFrame(regex_vectorized.toarray())

Then I could merge the four individual dataframes back together. This doesn't work because toarray() is too memory intensive. What is the best way to merge these four sparse matrices into one dataset with one unique line per document?

Derek Jedamski
  • 195
  • 1
  • 9
  • share .head() of your dataframes and a bit of your code so that we better visualize your issue – Zeugma Oct 20 '16 at 18:58
  • Thanks - I just updated to include some more information regarding the original dataset and the code I am using. – Derek Jedamski Oct 20 '16 at 20:51
  • You dont need to combine the vectorizers, you can combine the features before training the classifiers: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html – alvas Oct 21 '16 at 01:58
  • Possible duplicate of [Encode and assemble multiple features in PySpark](http://stackoverflow.com/questions/32982425/encode-and-assemble-multiple-features-in-pyspark) –  Oct 27 '16 at 16:56

0 Answers0