5

Is it possible to parallelize python CRFSuite (https://github.com/tpeng/python-crfsuite) ? I think CRF++ support parallelization, so I guess there must be some hooks to enable parallelization with CRFsuite as well.

Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
Joe Cheri Ross
  • 147
  • 3
  • 11

1 Answers1

2

No, it is currently not possible; parallel training is not implemented. There is some work on it in bug tracker though. You can still run cross-validation in parallel (i.e. train multiple models in parallel).

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
  • you can use model_selection module (GridSearch) to optimize hyper-params in parallel. However, for practical size data (nothing large really) it gets serialized since model size becomes large and pickling/unpickling the model with associated data to workers will take most of the time.. so unless data size is very small (tiny really), multiprocessing will get essentially serialized. – Kai Oct 16 '17 at 16:26
  • I am using this with conll-2002 data, which is not big at all; 14987 sentences for training, with baseline features only. When I run top I see 8 python processes -since I have 8 cpu cores- running round robin, one by one while the other 7 are sleeping... no parallelization at all. If you want to add additional features, it will be even slower. – Kai Oct 16 '17 at 16:47
  • Kai: you can extract features after starting a process, this is a right thing to do for cross-validation anyways. In this case only input sentences would be serialized. – Mikhail Korobov Oct 16 '17 at 16:57
  • i meant conll-2003 data. – Kai Oct 16 '17 at 17:27
  • Mikhail; wouldn't you still have to provide the data to each process?? this is whats taking the time; copying the data and sending it to each spawned process. In my case, the "feature extraction" is just a lookup from a dictionary, so I would still have to send a copy of the whole dictionary to each process (few Mb in size). – Kai Oct 17 '17 at 12:32
  • Let's say you have a script which loads data, extracts features and runs training/prediction on a particular train/test split. You start 8 of these scripts for different train/test splits and have cross-validation work parallelized. If your code using multiprocessing is slower than that (and unable to make use of 8 cores), then you're not sending data to processes at a right time; it is not a problem with approach, but a problem with an implementation. I can see how sklearn's interfaces can make writing such subpotimal code easy though. It may be better to start a separate SO question for it. – Mikhail Korobov Oct 17 '17 at 15:54
  • Mikhail, I am actually trying to speed up grid search on crfsuite for hyper-parameter optimization. It's taking for ever. – Kai Oct 17 '17 at 18:35