Parallelization possible with Python CRFsuite?

Question

Is it possible to parallelize python CRFSuite (https://github.com/tpeng/python-crfsuite) ? I think CRF++ support parallelization, so I guess there must be some hooks to enable parallelization with CRFsuite as well.

score 2 · Answer 1 · answered Oct 15 '16 at 19:00

2

No, it is currently not possible; parallel training is not implemented. There is some work on it in bug tracker though. You can still run cross-validation in parallel (i.e. train multiple models in parallel).

answered Oct 15 '16 at 19:00

Mikhail Korobov

21,908
8
73
65

you can use model_selection module (GridSearch) to optimize hyper-params in parallel. However, for practical size data (nothing large really) it gets serialized since model size becomes large and pickling/unpickling the model with associated data to workers will take most of the time.. so unless data size is very small (tiny really), multiprocessing will get essentially serialized. – Kai Oct 16 '17 at 16:26
I am using this with conll-2002 data, which is not big at all; 14987 sentences for training, with baseline features only. When I run top I see 8 python processes -since I have 8 cpu cores- running round robin, one by one while the other 7 are sleeping... no parallelization at all. If you want to add additional features, it will be even slower. – Kai Oct 16 '17 at 16:47
Kai: you can extract features after starting a process, this is a right thing to do for cross-validation anyways. In this case only input sentences would be serialized. – Mikhail Korobov Oct 16 '17 at 16:57
i meant conll-2003 data. – Kai Oct 16 '17 at 17:27
Mikhail; wouldn't you still have to provide the data to each process?? this is whats taking the time; copying the data and sending it to each spawned process. In my case, the "feature extraction" is just a lookup from a dictionary, so I would still have to send a copy of the whole dictionary to each process (few Mb in size). – Kai Oct 17 '17 at 12:32
Let's say you have a script which loads data, extracts features and runs training/prediction on a particular train/test split. You start 8 of these scripts for different train/test splits and have cross-validation work parallelized. If your code using multiprocessing is slower than that (and unable to make use of 8 cores), then you're not sending data to processes at a right time; it is not a problem with approach, but a problem with an implementation. I can see how sklearn's interfaces can make writing such subpotimal code easy though. It may be better to start a separate SO question for it. – Mikhail Korobov Oct 17 '17 at 15:54
Mikhail, I am actually trying to speed up grid search on crfsuite for hyper-parameter optimization. It's taking for ever. – Kai Oct 17 '17 at 18:35

Parallelization possible with Python CRFsuite?

1 Answers1