I experiment with multiple classifiers. I want all the classifiers saved and easily accessible from me during testing. At present, when using LinearSVC, the trained model is 5 MB or less. When using SVC, the model size becomes more than 400 MB, which takes almost one minute to load to memory. I am ok using LinearSVC but I would like also to experiment with RBF kernels. I cannot understand the humongous difference between the predescribed sizes. Can anyone explain to me why this happens (if it is explainable, otherwise point me to a probable bug) and maybe propose a solution to truncate the size of the SVC model, or evade the usage of SVC for RBF kernel implementation? Thank you all.
Example
Taken from the tutorials page and added pickle.
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
import cPickle as pickle
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X, y)
lin_svc = svm.LinearSVC(C=C).fit(X, y)
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X, y)
with open('svcpick','w') as out:
pickle.dump(svc,out)
with open('rbfsvcpick','w') as out:
pickle.dump(rbf_svc,out)
with open('linsvcpick','w') as out:
pickle.dump(lin_svc,out)
print 'SVC(Linear):',os.path.getsize('./svcpick'),' B'
print 'SVC(RBF):',os.path.getsize('./rbfsvcpick'),' B'
print 'LinearSVC:',os.path.getsize('./linsvcpick'),' B'
Output:
SVC(Linear): 11481 B
SVC(RBF): 12087 B
LinearSVC: 1188 B
Another example for multilabel classification
Again taken (partly) from tutorials
import os
import numpy as np
from sklearn import svm, datasets
from sklearn.datasets import make_multilabel_classification
from sklearn.multiclass import OneVsRestClassifier
import cPickle as pickle
# import some data to play with
X, Y = make_multilabel_classification(n_classes=10, n_labels=1,
allow_unlabeled=True,
random_state=1)
msvc = OneVsRestClassifier(svm.SVC(kernel='linear')).fit(X, Y)
mrbf_svc = OneVsRestClassifier(svm.SVC(kernel='rbf')).fit(X, Y)
mlin_svc = OneVsRestClassifier(svm.LinearSVC()).fit(X, Y)
with open('msvcpick','w') as out:
pickle.dump(msvc,out)
with open('mrbfsvcpick','w') as out:
pickle.dump(mrbf_svc,out)
with open('mlinsvcpick','w') as out:
pickle.dump(mlin_svc,out)
print 'mSVC(Linear):',os.path.getsize('./msvcpick'),' B'
print 'mSVC(RBF):',os.path.getsize('./mrbfsvcpick'),' B'
print 'mLinearSVC:',os.path.getsize('./mlinsvcpick'),' B'
Output:
mSVC(Linear): 126539 B
mSVC(RBF): 561532 B
mLinearSVC: 9782 B
In my implementation I am trying to use multilabel classification with more than 2 classes, that's why I changed the default value to 10.One can see the difference in size. In my implementation mLinearSVC has size more than 1 MB, not 10KB, as shown above, due to the multidimensional data I have to process (256 features each sample).