13

I have been blowing my brains out over the past 2-3 weeks on this problem. I have a multi-label (not multi-class) problem where each sample can belong to several of the labels.

I have around 4.5 million text documents as training data and around 1 million as test data. The labels are around 35K.

I am using scikit-learn. For feature extraction I was previously using TfidfVectorizer which didn't scale at all, now I am using HashVectorizer which is better but not that scalable given the number of documents that I have.

vect = HashingVectorizer(strip_accents='ascii', analyzer='word', stop_words='english', n_features=(2 ** 10))

SKlearn provides a OneVsRestClassifier into which I can feed any estimator. For multi-label I found LinearSVC & SGDClassifier only to be working correctly. Acc to my benchmarks SGD outperforms LinearSVC both in memory & time. So, I have something like this

clf = OneVsRestClassifier(SGDClassifier(loss='log', penalty='l2', n_jobs=-1), n_jobs=-1)

But this suffers from some serious issues:

  1. OneVsRest does not have a partial_fit method which makes it impossible for out-of-core learning. Are there any alternatives for that?
  2. HashingVectorizer/Tfidf both work on a single core and don't have any n_jobs parameter. It's taking too much time to hash the documents. Any alternatives/suggestions? Also is the value of n_features correct?
  3. I tested on 1 million documents. The Hashing takes 15 minutes and when it comes to clf.fit(X, y), I receive a MemoryError because OvR internally uses LabelBinarizer and it tries to allocate a matrix of dimensions (y x classes) which is fairly impossible to allocate. What should I do?
  4. Any other libraries out there which have reliable & scalable multi-label algorithms? I know of genism & mahout but both of them don't have anything for multi-label situations?
Gaurav Kumar
  • 986
  • 13
  • 18
  • 2
    Just a remark when you say "HashVectorizer which is better but not that scalable": `HashVectorizer` is perfectly scalable: if you throw twice as much computational resource you will process data twice faster (you can partition the data and run the processing in parallel thanks to it statelessness and bounded memory usage). This is the exact definition of scalability. I agree that `HashVectorizer` could probably be more optimized to work faster on the same computational resources but this has nothing to do with the scalability problem. – ogrisel Sep 08 '13 at 16:03
  • Thanks for the clarification. I do agree that HV is really advantageous over Tfidf, I wasn't sure on the data partitioning part. Now I did a small POC to partition data and run the HV on the parts separately and then combine the results later. What I meant initially was that the work on the algorithm part is a great achievement but still it can be made more scalable like you suggested to partition & run in parallel. (After I've done, I will submit a PR so that HV also has a n_jobs parameter) – Gaurav Kumar Sep 09 '13 at 08:55
  • Unfortunately in the current implementation of joblib used in scikit-learn we use multiprocessing hence the input data has to be copied to be sent over to the subprocesses. So such a n_jobs parameter would add a significant overhead and might not be beneficial at all. If you really have large dataset it's better to handle many parallel out-of-core loops that deal with the data access (disk, DB, network...) themselves an avoid any memory copy. However such boiler plate code will probably never be included in scikit-learn as too project specific / frameworkish. – ogrisel Sep 09 '13 at 09:39

4 Answers4

8

I would do the multi-label part by hand. The OneVsRestClassifier treats them as independent problems anyhow. You can just create the n_labels many classifiers and then call partial_fit on them. You can't use a pipeline if you only want to hash once (which I would advise), though. Not sure about speeding up hashing vectorizer. You gotta ask @Larsmans and @ogrisel for that ;)

Having partial_fit on OneVsRestClassifier would be a nice addition, and I don't see a particular problem with it, actually. You could also try to implement that yourself and send a PR.

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
  • I am not surprised ;) – Andreas Mueller Sep 08 '13 at 15:03
  • Thanks, if I were to code OvR by hand, which estimator would you recommend for this problem? Also, say, I fire up 35K estimators (n_labels) and individually fit them on the training data. How would I compute the labels from these? Those estimators with the individual predict_proba > 0.5 will have their labels associated with that sample. Will this approach work? (sorry, I'm just 3 weeks old at ML & sklearn) – Gaurav Kumar Sep 08 '13 at 15:21
  • 3
    You can should try to train independent instances of `SGDClassifier` and `PassiveAggressiveClassifier` and maybe `MultinomialNB` as binary classifiers (one for each label). Then you can rank the top predictions based on the values of `predict_proba` or `decision_function` and take the top 5 labels (or less if they predict below 0.5 proba or negative decision function). You can also train a second regression model that takes the probas of the binary classification models and predict the expected number of positive labels (the value of k in top k) to retain for each instance. – ogrisel Sep 08 '13 at 16:10
  • +1 for linear models (why would you use multinomial instead of Bernoulli olivier?). I would really first try the thresholding and see how that works. If the labels are very unbalanced, you might need to adjust class weights. Btw, 35k is quite a lot. You might run into memory trouble. Keep in mind you need to store n_labels * n_features coefficients. – Andreas Mueller Sep 08 '13 at 19:44
  • Thanks a lot for all your valuable suggestions. I am currently building a custom multi-label wrapper by hand over `SGDClassifier`. I'm using `decision_function` since they have only 1 float value while `predict_proba` has 2 values- one for 0 and one for 1 class. I will report my progress soon, or problems if I run into any. – Gaurav Kumar Sep 09 '13 at 09:02
  • @AndreasMueller: MultinomialNB is the default option for text classification. BernoulliNB is better for problems with boolean variables, or for very short texts. – Fred Foo Sep 09 '13 at 09:29
8
  1. The algorithm that OneVsRestClassifier implements is very simple: it just fits K binary classifiers when there are K classes. You can do this in your own code instead of relying on OneVsRestClassifier. You can also do this on at most K cores in parallel: just run K processes. If you have more classes than processors in your machine, you can schedule training with a tool such as GNU parallel.
  2. Multi-core support in scikit-learn is work in progress; fine-grained parallel programming in Python is quite tricky. There are potential optimizations for HashingVectorizer, but I (one of the hashing code's authors) haven't come round to it yet.
  3. If you follow my (and Andreas') advice to do your own one-vs-rest, this shouldn't be a problem anymore.
  4. The trick in (1.) applies to any classification algorithm.

As for the number of features, it depends on the problem, but for large scale text classification 2^10 = 1024 seems very small. I'd try something around 2^18 - 2^22. If you train a model with L1 penalty, you can call sparsify on the trained model to convert its weight matrix to a more space-efficient format.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Thanks, I will try to implement OvR by hand and will try to circumvent scalability issues. I forgot to mention that the length of each document is very small (200 words or so). So, I figured that 1024 features should be enough because 2^18 were giving me a lot of memory problems. I even went to the extent of firing up an AWS instance of 30 GB RAM but that didn't work either. – Gaurav Kumar Sep 08 '13 at 15:22
  • 2
    If you have 35K binary classifiers with 2 ** 18 features you will need 73GB just to store the aggregate model. It might be possible to sparsify the models once the weights are learned to spare memory at prediction time but AFAIK this is not implemented yet in scikit-learn. You can implement the `decision_function` manually with `safe_sparse_dot` to do that. – ogrisel Sep 08 '13 at 16:14
  • 1
    To train models that have many zero weights hence which would lead to improved memory usage once the `coef_` attribute is stored as `scipy.sparse` matrix, you should use `SGDClassifier` with `penalty="elasticnet"` or `"l1"`. – ogrisel Sep 08 '13 at 16:16
  • 2
    @ogrisel: linear classifiers have a `sparsify` method that converts the `coef_` into a sparse matrix format (CSR). – Fred Foo Sep 08 '13 at 19:06
  • Great, I wasn't sure. Then this is the way to go Gaurav: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.sparsify – ogrisel Sep 08 '13 at 19:41
  • Thanks, I'm currently using `SGDClassifier`. Do I use `penalty='l2'` and call `sparsify()` OR using `penalty='l1'` only is enough for having sparse coefficient matrices? – Gaurav Kumar Sep 09 '13 at 08:59
  • @GauravKumar: you need both. The thing is that you can't train while the coefficients are sparse, so you sparsify after training (e.g. before serializing). – Fred Foo Sep 09 '13 at 09:28
1

My argument for scalability is that instead of using OneVsRest which is just a simplest of simplest baselines, you should use a more advanced ensemble of problem-transformation methods. In my paper I provide a scheme for dividing label space into subspaces and transforming the subproblems into multi-class single-label classifications using Label Powerset. To try this, just use the following code that utilizes a multi-label library built on top of scikit-learn - scikit-multilearn:

from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from skmultilearn.cluster import IGraphLabelCooccurenceClusterer
from skmultilearn.problem_transform import LabelPowerset

from sklearn.linear_model import SGDClassifier

# base multi-class classifier SGD
base_classifier = SGDClassifier(loss='log', penalty='l2', n_jobs=-1)

# problem transformation from multi-label to single-label multi-class
transformation_classifier = LabelPowerset(base_classifier)

# clusterer dividing the label space using fast greedy modularity maximizing scheme
clusterer = IGraphLabelCooccurenceClusterer('fastgreedy', weighted=True, include_self_edges=True) 

# ensemble
clf = LabelSpacePartitioningClassifier(transformation_classifier, clusterer)

clf.fit(x_train, y_train)
prediction = clf.predict(x_test)
niedakh
  • 2,819
  • 2
  • 20
  • 19
0

The partial_fit() method was recently added to sklearn so hopefully it should be available in the upcoming release (it's in the master branch already).

The size of your problem makes it attractive to tackling it with neural networks. Have a look at magpie, it should give much better results than linear classifiers.

stpk
  • 2,015
  • 1
  • 16
  • 23