2

I have the following snippet running to train a model for text classification. I optmized it quite a bit and it's running pretty smoothly however, it still uses a lot of RAM. Our dataset is huge (13 million documents + 18 million words in the vocabulary) but the point in execution throwing the error is very weird, in my opinion. The script:

encoder = LabelEncoder()
y = encoder.fit_transform(categories)
classes = list(range(0, len(encoder.classes_)))

vectorizer = CountVectorizer(vocabulary=vocabulary,
                             binary=True,
                             dtype=numpy.int8)

classifier = SGDClassifier(loss='modified_huber',
                           n_jobs=-1,
                           average=True,
                           random_state=1)

tokenpath = modelpath.joinpath("tokens")
for i in range(0, len(batches)):
    token_matrix = joblib.load(
        tokenpath.joinpath("{}.pickle".format(i)))
    batchsize = len(token_matrix)
    classifier.partial_fit(
        vectorizer.transform(token_matrix),
        y[i * batchsize:(i + 1) * batchsize],
        classes=classes
    )

joblib.dump(classifier, modelpath.joinpath('classifier.pickle'))
joblib.dump(vectorizer, modelpath.joinpath('vectorizer.pickle'))
joblib.dump(encoder, modelpath.joinpath('category_encoder.pickle'))
joblib.dump(options, modelpath.joinpath('extraction_options.pickle'))

I got the MemoryError at this line:

joblib.dump(vectorizer, modelpath.joinpath('vectorizer.pickle'))

At this point in execution, training is finished and the classifier is already dumped. It should be collected by the garbage collector in case more memory is needed. In addition to it, why should joblib allocate so much memory if it isn't even compressing the data.

I do not have deep knowledge of the inner workings of the python garbage collector. Should I be forcing gc.collect() or use 'del' statments to free those objects that are no longer needed?

Update:

I have tried using the HashingVectorizer and, even though it greatly reduces memory usage, the vectorizing is way slower making it not a very good alternative.

I have to pickle the vectorizer to later use it in the classification process so I can generate the sparse matrix that is submitted to the classifier. I will post here my classification code:

extracted_features = joblib.Parallel(n_jobs=-1)(
    joblib.delayed(features.extractor) (d, extraction_options) for d in documents)

probabilities = classifier.predict_proba(
    vectorizer.transform(extracted_features))

predictions = category_encoder.inverse_transform(
    probabilities.argmax(axis=1))

trust = probabilities.max(axis=1)
Fabio Picchi
  • 1,202
  • 2
  • 10
  • 22
  • Could you use `HashingVectorizer` instead? What type is your `vocabulary`? Why do you need to pickle the vectorizer in the first place? – krassowski Mar 29 '18 at 12:30
  • @krassowski I updated my question to include further details on the classification process. Also, the vocabulary is a set of strings containing all the features extracted from the documents – Fabio Picchi Mar 29 '18 at 14:14

1 Answers1

2

If you are providing your custom vocabulary to the CountVectorizer, it should not be a problem to recreate it later on, during classification. As you provide set of strings instead of a mapping, you probably want to use the parsed vocabulary, which you can access with:

parsed_vocabulary = vectorizer.vocabulary_
joblib.dump(parsed_vocabulary, modelpath.joinpath('vocabulary.pickle'))

and then load it and use to re-create the CountVectorizer:

vectorizer = CountVectorizer(
    vocabulary=parsed_vocabulary,
    binary=True,
    dtype=numpy.int8
)

Note that you do not need to use joblib here; the standard pickle should perform the same; you might get better results using any of available alternatives, with PyTables being worth mentioning.

If that uses to much of the memory too, you should try using the original vocabulary for recreation of the vectorizer; currently, when provided with a set of strings as vocabulary, vectorizers just convert sets to sorted lists so you shouldn't need to worry about reproducibility (although I would double check that before using in production). Or you could just convert the set to a list on your own.

To sum up: because you do not fit() the Vectorizer, the whole added value of using CountVectorizer is its transform() method; as the whole needed data is the vocabulary (and parameters) you might reduce the memory consumption pickling just your vocabulary, either processed or not.

As you asked for answer drawing from official sources, I would like to point you to: https://github.com/scikit-learn/scikit-learn/issues/3844 where an owner and a contributor of scikit-learn mention recreating a CountVectorizer, albeit for other purposes. You may have better luck reporting your problems in the linked repo, but make sure to include a dataset which causes excessive memory usage issues to make it reproducible.

And finally you may just use HashingVectorizer as mentioned earlier in a comment.

PS: regarding the use of gc.collect() - I would give it a go in this case; regarding the technical details, you will find many questions on SO tackling this issue.

krassowski
  • 13,598
  • 4
  • 60
  • 92
  • krassowski, Thank you for the detailed answer. I am sorry I have drifted away from the main topic of my question leading you to give alternatives to the method I used. I really value your tips and will probably use them but what I really wanted to know was whether or not classifier would be collected by the gc if I called gc.collect right before dumping the vectorizer. Also, I don't understand how could joblib.dump increase memory usage to the point the that the process got killed. Shouldn't the gc be invoked if there is no free memory left? If it was invoked, why couldnt it free memory? – Fabio Picchi Mar 29 '18 at 21:00
  • any python object would be collected when gc.collect() is called if there are no references to this object kept, so using del would be needed; If you want to have garbage collector run just before running out of memory you would be better of invoking it manually than hoping it will work by itself. gc in python runs in cycles, not when the memory is full; more precisely it starts when certain amount of memory alloc/dealoc operation was performed; see more in docs or https://stackoverflow.com/a/22440880/6646912 – krassowski Mar 29 '18 at 21:38