Python blocks when creating large DTM with CounterVectorizer

Question

Following my previous question I computed a code that creates a DTM. I would then need to make some calculations among columns and rows of my DTM. However, Python blocks when computing the last lineand is really impossible to run the code (the whole pc blocks). How to make the process smoothier?

Here is the code I am running (of course, (texts) is extremely larger)

texts=['text1', 'text4', 'text2', 'text3'] (each text has already been stemmed and removed punctuation)

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
import itertools 

merged = list(itertools.chain.from_iterable(texts))

vect = CountVectorizer(min_df=0., max_df=1.0)
X = vect.fit_transform(texts)
df = pd.DataFrame(X.toarray().transpose(), index = vect.get_feature_names())

Probably the instruction needs too much mem. Open memory monitor before the execution. Unfortunately it can be a problem: when calculating large matrices (especially multidimensional) it can be not trivial to free space on already processed data because interpreter does not know which cells of matrix are not required more. You can evaluate on cloud with large amount of memory. And of course time of evaluating multidimensional matrix grows exponentially with number of dimensions. Many dimensions typically produce huge number of cells. — sergzach, Oct 10 '16 at 12:17

Python blocks when creating large DTM with CounterVectorizer

0 Answers0