-1

I am using sklearn's randomforestclassifier to predict a set of classes. I have over 26000 classes and therefore the size of classifier is exceeding over 30 GBs. I am running it on linux with 64 GB of RAM and 20 GB storage.

I am trying to pickle my model by using joblib but it is not working as i don't have enough secondary storage (i guess). Is there any way by which this could be done?? Maybe some kind of compression technique or something else??

DumbCoder
  • 233
  • 2
  • 9

2 Answers2

0

You could try to gzip the pickle

compressed_pickle = StringIO.StringIO()
with gzip.GzipFile(fileobj=compressed_pickle, mode='w') as f:
    f.write(pickle.dumps(classifier))

Then you can write the compressed_pickle to a file.

To read it back:

with open('rf_classifier.pickle', 'rb') as f:
    compressed_pickle  = f.read()
rf_classifier = pickle.loads(zlib.decompress(compressed_pickle, 16 + zlib.MAX_WBITS))

EDIT

It appears Python versions prior to 3.4 used to have a hard limit of 4GB on the serialized object size. The latest version of the pickle protocol (version 4.0) does not have this limit, just specify the protocol version:

pickle.dumps(obj, protocol=4)

For older versions of Python please refer this answer: _pickle in python3 doesn't work for large data saving

Abhinav Upadhyay
  • 2,477
  • 20
  • 32
  • I tried this, but i am getting the same error: Overflow: cannot serialize a bytes object larger than 4 GiB – DumbCoder Sep 24 '18 at 14:12
  • @Shiv it seems Python versions prior to 3.4 had hard coded limit of 4 GB on pickled objects. If you are using a later version of Python, specify protocol=4 in call to dump. For older version of python I linked another answer in my answer. – Abhinav Upadhyay Sep 25 '18 at 11:02
0

A possible workaround is to dump the individual trees into a folder:

path = '/folder/tree_{}'
import _pickle as cPickle
i = 0
for tree in model.estimators_:
    with open(path.format(i), 'wb') as f:
        cPickle.dump(tree, f)
    i+=1
   

In sklearn's implementation of Random Forest, the attribute "estimators_" is a list containing the individual trees. You could consider serializing all trees indivually into a folder.

To generate predictions you can average the tree's predictions

# load the trees
path = '/folder/tree_{}'
import _pickle as cPickle
trees = []
i = 0
for i in range(num_trees):
    with open(path.format(i), 'rb') as f:
        trees.append(cPickle.load(f))
    i+=1
# generate predictions
predictions = []
for tree in trees:
    predictions.append(tree.predict(X))
predictions = np.asarray(predictions).T

# average predictions as in a RF
y_pred = predictions.mean(axis=0)