Trying to avoid .toarray() when loading data into an SVC model in scikit-learn

Question

I'm trying to plug a bunch of data (sentiment-tagged tweets) into an SVM using scikit-learn. I've been using CountVectorizer to build a sparse array of word counts, and it's all working fine with smallish data sets (~5000 tweets). However, when I try to use a larger corpus (ideally 150,000 tweets, but I'm currently exploring with 15,000), .toarray(), which converts a sparse format to a denser format, immediately starts taking up immense amounts of memory (30k tweets hit over 50gb before the MemoryError.

So my question is -- is there a way to feed LinearSVC() or a different manifestation of SVM a sparse matrix? Am I necessarily required to use a dense matrix? It doesn't seem like a different vectorizer would help fix this problem (as this problem seems to be solved by: MemoryError in toarray when using DictVectorizer of Scikit Learn). Is a different model the solution? It seems like all of the scikit-learn models require a dense array representation at some point, unless I've been looking in the wrong places.

cv = CountVectorizer(analyzer=str.split)
clf = svm.LinearSVC()

X = cv.fit_transform(data)
trainArray = X[:breakpt].toarray()
testArray = X[breakpt:].toarray()

clf.fit(trainArray, label)
guesses = clf.predict(testArray)

score 2 · Accepted Answer · answered Dec 16 '14 at 21:50

LinearSVC.fit and its predict method can both handle a sparse matrix as the first argument, so just removing the toarray calls from your code should work.

All estimators that take sparse inputs are documented as doing so. E.g., the docstring for LinearSVC states:

Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
    Training vector, where n_samples in the number of samples and
    n_features is the number of features.

wow - I can't believe I missed that, thanks for pointing it out — Razi Shaban, Dec 17 '14 at 00:50

Trying to avoid .toarray() when loading data into an SVC model in scikit-learn

1 Answers1