10

I got linearsvc working against training set and test set using load_file method i am trying to get It working on Multiprocessor enviorment.

How can i get multiprocessing work on LinearSVC().fit() LinearSVC().predict()? I am not really familiar with datatypes of scikit-learn yet.

I am also thinking about splitting samples into multiple arrays but i am not familiar with numpy arrays and scikit-learn data structures.

Doing this it will be easier to put into multiprocessing.pool() , with that , split samples into chunks , train them and combine trained set back later , would it work ?

EDIT: Here is my scenario:

lets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense?

Thanks.

PS: Please do not close this .

Phyo Arkar Lwin
  • 6,673
  • 12
  • 41
  • 55
  • 2
    Correct me if I'm wrong, but an SVM usually doesn't take long to make a decision. It might make more sense to perform the decoding for different samples in parallel than to parallelize the decoding for one sample. – Qnan Oct 25 '12 at 12:16
  • what if i am going to do that on 21 million documents? Would it take long? – Phyo Arkar Lwin Oct 25 '12 at 12:32
  • I am thinking about different samples too , is it able to re-combine different samples after spliting them for each process? – Phyo Arkar Lwin Oct 25 '12 at 12:39
  • 1
    I don't think I get your question. The samples *are* independent. Why do you have to re-combine something? – Qnan Oct 25 '12 at 13:37
  • lets say , we have 1 million files in training sample set , when we want to distribute processing of Tfidfvectorizer on several processors we have to split those samples (for my case it will only have two categories , so lets say 500000 each samples to train) . My server have 24 cores with 48 GB , so i want to split each topics into number of chunks 1000000 / 24 and process Tfidfvectorizer on them. Like that i would do to Testing sample set , as well as SVC.fit() and decide(). Does it make sense? – Phyo Arkar Lwin Oct 25 '12 at 14:49
  • so for those splits of samples , i need to recombine at the end of multiprocessing to get back training sets. – Phyo Arkar Lwin Oct 25 '12 at 14:50
  • I see. You mentioned only testing before, which is why I was surprised. Once the model is trained, the decision can be made for each sample in the testing set independently, so that parallelizes well. Training is a different thing, however, -- parallelizing SVM training is by no means trivial, and to my knowledge scikit-learn doesn't implement it. – Qnan Oct 25 '12 at 15:59
  • 1
    `Tfidfvectorizer` is not parallelizable be case of the central vocabulary. We either need a shared vocabulary (e.g. using a redis server on the cluster) or implement a `HashVectorizer` that does not exists yet. – ogrisel Oct 26 '12 at 09:20
  • What is the status of the hashing vectorizer? I would also like to be able to use joblib.Parallel for vectorization. – John Thompson Dec 27 '12 at 22:20
  • I see some pull request on 14.0 github regarding parallel. I haven't got a chance to test it yet coz we are already in development using 13.0 – Phyo Arkar Lwin Apr 03 '13 at 17:53

2 Answers2

13

I think using SGDClassifier instead of LinearSVC for this kind of data would be a good idea, as it is much faster. For the vectorization, I suggest you look into the hash transformer PR.

For the multiprocessing: You can distribute the data sets across cores, do partial_fit, get the weight vectors, average them, distribute them to the estimators, do partial fit again.

Doing parallel gradient descent is an area of active research, so there is no ready-made solution there.

How many classes does your data have btw? For each class, a separate will be trained (automatically). If you have nearly as many classes as cores, it might be better and much easier to just do one class per core, by specifying n_jobs in SGDClassifier.

Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
11

For linear models (LinearSVC, SGDClassifier, Perceptron...) you can chunk your data, train independent models on each chunk and build an aggregate linear model (e.g. SGDClasifier) by sticking in it the average values of coef_ and intercept_ as attributes. The predict method of LinearSVC, SGDClassifier, Perceptron compute the same function (linear prediction using a dot product with an intercept_ threshold and One vs All multiclass support) so the specific model class you use for holding the average coefficient is not important.

However as previously said the tricky point is parallelizing the feature extraction and current scikit-learn (version 0.12) does not provide any way to do this easily.

Edit: scikit-learn 0.13+ now has a hashing vectorizer that is stateless.

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • 1
    Thanks man, I will look test it out. so Tfidfvectorizer is not parallizable yet right? and , Feature Extraction takes the most time on our tests. – Phyo Arkar Lwin Oct 26 '12 at 14:59
  • 1
    Yes this is a known limitation of scikit-learn. Efficient parallelizable text feature extraction using a hashing vectorizer is high on my personal priority list, but I have got even higher priorities right now :) – ogrisel Oct 27 '12 at 16:23
  • 1
    Ic , if i want to contribute where should i start? – Phyo Arkar Lwin Oct 31 '12 at 15:14
  • 1
    If you want to contribute a hashing text vectorizer you should first get familiar with the existing `CountVectorizer` implementation by reading its source code and the source code of related files. Then read the following paper [Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, Josh Attenberg, Feature Hashing for Large Scale Multitask Learning, ICML 2009](http://arxiv.org/pdf/0902.2206.pdf), then have a look at this [pull request on a hashing transformer](https://github.com/scikit-learn/scikit-learn/pull/909) that is closely related but not a hashing text vectorizer. – ogrisel Oct 31 '12 at 15:24
  • The read the [contributors guide of scikit-learn](http://scikit-learn.org/dev/developers/index.html#contributing). – ogrisel Oct 31 '12 at 15:25
  • Thanks a lot . i will look into it. Actually I was also looking into Count Vectorizer and sees some places where multiprocessing can work. I am already thinking about putting python standard multiprocessing.pool() on some loops. such as : _word_ngrams() ,_char_wb_ngrams() methods without even using hashing vectorizor – Phyo Arkar Lwin Nov 01 '12 at 16:31
  • Please use `joblib.Parallel` rather than multiprocessing loops directly (see other usage in the scikit-learn source code for example). AFAIK, we did try to parallelize such inner loops but the overhead does not make it interesting at this level. – ogrisel Nov 02 '12 at 20:23
  • i c , i do not have experience with joblib.Parallel and not sure about its performance (And stability) . – Phyo Arkar Lwin Nov 06 '12 at 15:01
  • I will look into it. FYI i have a question on extracting features out of each files (i want to show top 10 terms out of each files in test data set): http://stackoverflow.com/q/13181409/200044 – Phyo Arkar Lwin Nov 06 '12 at 15:17