How to serialize a large randomforest classifier

Question

I am using sklearn's randomforestclassifier to predict a set of classes. I have over 26000 classes and therefore the size of classifier is exceeding over 30 GBs. I am running it on linux with 64 GB of RAM and 20 GB storage.

I am trying to pickle my model by using joblib but it is not working as i don't have enough secondary storage (i guess). Is there any way by which this could be done?? Maybe some kind of compression technique or something else??

Abhinav Upadhyay · Answer 1 · 2018-09-25T11:01:14.100

0

You could try to gzip the pickle

compressed_pickle = StringIO.StringIO()
with gzip.GzipFile(fileobj=compressed_pickle, mode='w') as f:
    f.write(pickle.dumps(classifier))

Then you can write the compressed_pickle to a file.

To read it back:

with open('rf_classifier.pickle', 'rb') as f:
    compressed_pickle  = f.read()
rf_classifier = pickle.loads(zlib.decompress(compressed_pickle, 16 + zlib.MAX_WBITS))

EDIT

It appears Python versions prior to 3.4 used to have a hard limit of 4GB on the serialized object size. The latest version of the pickle protocol (version 4.0) does not have this limit, just specify the protocol version:

pickle.dumps(obj, protocol=4)

For older versions of Python please refer this answer: _pickle in python3 doesn't work for large data saving

edited Sep 25 '18 at 11:01

answered Sep 21 '18 at 10:51

Abhinav Upadhyay

2,477
20
32

I tried this, but i am getting the same error: Overflow: cannot serialize a bytes object larger than 4 GiB – DumbCoder Sep 24 '18 at 14:12
@Shiv it seems Python versions prior to 3.4 had hard coded limit of 4 GB on pickled objects. If you are using a later version of Python, specify protocol=4 in call to dump. For older version of python I linked another answer in my answer. – Abhinav Upadhyay Sep 25 '18 at 11:02

Santiago Armstrong · Answer 2 · 2021-07-26T02:24:45.407

A possible workaround is to dump the individual trees into a folder:

path = '/folder/tree_{}'
import _pickle as cPickle
i = 0
for tree in model.estimators_:
    with open(path.format(i), 'wb') as f:
        cPickle.dump(tree, f)
    i+=1

In sklearn's implementation of Random Forest, the attribute "estimators_" is a list containing the individual trees. You could consider serializing all trees indivually into a folder.

To generate predictions you can average the tree's predictions

# load the trees
path = '/folder/tree_{}'
import _pickle as cPickle
trees = []
i = 0
for i in range(num_trees):
    with open(path.format(i), 'rb') as f:
        trees.append(cPickle.load(f))
    i+=1
# generate predictions
predictions = []
for tree in trees:
    predictions.append(tree.predict(X))
predictions = np.asarray(predictions).T

# average predictions as in a RF
y_pred = predictions.mean(axis=0)

How to serialize a large randomforest classifier

2 Answers2