How to use multi CPU cores to train NNs using caffe and OpenBLAS

Question

I am learning deep learning recently and my friend recommended me caffe. After install it with OpenBLAS, I followed the tutorial, MNIST task in the doc. But later I found it was super slow and only one CPU core was working.

The problem is that the servers in my lab don't have GPU, so I have to use CPUs instead.

I Googled this and got some page like this . I tried to export OPENBLAS_NUM_THREADS=8 and export OMP_NUM_THREADS=8. But caffe still used one core.

How can I make caffe use multi CPUs?

Many thanks.

@Jeff I just `make` and `make install`. I found a [page](https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded), but it does not say anything about building it to use threads. How can I build it to use threads? — magic282, May 13 '15 at 00:54
@Jeff I have to say I didn't find anywhere mentioned about compile OpenBLAS with any parameters related to `threads`. — magic282, May 13 '15 at 02:21
USE_OPENMP=1 is noted in https://github.com/xianyi/OpenBLAS/blob/develop/README.md. That's how I always build for threaded usage. — Jeff Hammond, May 13 '15 at 02:28
@Jeff Sadly caffe still use one CPU after I rebuild OpenBLAS with `USE_OPENMP=1` and then rebuild caffe. :( — magic282, May 13 '15 at 04:50
How do you know Caffe is only using one core? The temporal resolution of `top` may not be enough to catch `dgemm` in action. Have you run `gprof` to see if increasing `OMP_NUM_THREADS` affects wall time? — Jeff Hammond, May 13 '15 at 13:38
@Jeff I uninstall and reinstall the whole thing and it works. But even though I can use all the cpu it's still suuuper slow. Well, the good news is that my boss bought a TITAN X for the lab. LOL — magic282, May 14 '15 at 11:49
There is an OpenMP version of Caffe on Github that is competitive with the GPU port for some workloads. You might try to find it. Threading GEMM isn't always the best way to make DNN go faster... — Jeff Hammond, May 14 '15 at 14:26
@Jeff what is special about this fork? I can't see any references to openmp. — mrgloom, Nov 30 '15 at 08:41
@mrgloom Sorry, they are in the process of reworking the OpenMP stuff. There is https://github.com/intelcaffe/caffe-old/tree/openmp but I suspect you want to watch e.g. https://github.com/intelcaffe/caffe/commits/openmp-conv-relu. — Jeff Hammond, Nov 30 '15 at 13:28

GieBur · Answer 1 · 2016-05-20T16:39:35.573

@Karthik. That also works for me. One interesting discovery that I made was that using 4 threads reduces forward/backward pass during the caffe timing test by a factor of 2. However, increasing the thread count to 8 or even 24 results in f/b speed that is less than what I get with OPENBLAS_NUM_THREADS=4. Here are times for a few thread counts (tested on NetworkInNetwork model).

[#threads] [f/b time in ms]
1 223
2 150
4 113
8 125
12 144

For comparison, on a Titan X GPU the f/b pass took 1.87 ms.

dipendra009 · Answer 2 · 2017-07-20T21:43:14.963

1

While building OpenBLAS, you have to set the flag USE_OPENMP = 1 to enable OpenMP support. Next set Caffe to use OpenBLAS in the Makefile.config, please export the number of threads you want to use during runtime by setting up OMP_NUM_THREADS=n where n is the number of threads you want.

edited Jul 20 '17 at 21:43

answered May 25 '17 at 04:16

dipendra009

299
2
7

score -2 · Answer 3 · edited May 23 '17 at 11:46

-2

I found that this method works:

When you build the caffe, in your make command, do use this for 8 cores: make all -j8 and make pycaffe -j8

Also, make sure OPENBLAS_NUM_THREADS=8 is set.

This question has a full script for the same.

edited May 23 '17 at 11:46

Community

1
1

answered Jan 24 '16 at 03:27

Karthik Hegde

173
1
10

How to use multi CPU cores to train NNs using caffe and OpenBLAS

3 Answers3

Linked