4

I am learning deep learning recently and my friend recommended me caffe. After install it with OpenBLAS, I followed the tutorial, MNIST task in the doc. But later I found it was super slow and only one CPU core was working.

The problem is that the servers in my lab don't have GPU, so I have to use CPUs instead.

I Googled this and got some page like this . I tried to export OPENBLAS_NUM_THREADS=8 and export OMP_NUM_THREADS=8. But caffe still used one core.

How can I make caffe use multi CPUs?

Many thanks.

magic282
  • 101
  • 1
  • 6
  • Did you build openblas to use threads? – Jeff Hammond May 12 '15 at 18:34
  • @Jeff I just `make` and `make install`. I found a [page](https://github.com/xianyi/OpenBLAS/wiki/faq#multi-threaded), but it does not say anything about building it to use threads. How can I build it to use threads? – magic282 May 13 '15 at 00:54
  • 1
    Read the docs. It's pretty clear. – Jeff Hammond May 13 '15 at 00:56
  • @Jeff I have to say I didn't find anywhere mentioned about compile OpenBLAS with any parameters related to `threads`. – magic282 May 13 '15 at 02:21
  • USE_OPENMP=1 is noted in https://github.com/xianyi/OpenBLAS/blob/develop/README.md. That's how I always build for threaded usage. – Jeff Hammond May 13 '15 at 02:28
  • @Jeff Sadly caffe still use one CPU after I rebuild OpenBLAS with `USE_OPENMP=1` and then rebuild caffe. :( – magic282 May 13 '15 at 04:50
  • What's your system config? – Jeff Hammond May 13 '15 at 05:01
  • @Jeff Centos 6.5, 24 core CPU. Did you mean these? – magic282 May 13 '15 at 05:14
  • How do you know Caffe is only using one core? The temporal resolution of `top` may not be enough to catch `dgemm` in action. Have you run `gprof` to see if increasing `OMP_NUM_THREADS` affects wall time? – Jeff Hammond May 13 '15 at 13:38
  • @Jeff I uninstall and reinstall the whole thing and it works. But even though I can use all the cpu it's still suuuper slow. Well, the good news is that my boss bought a TITAN X for the lab. LOL – magic282 May 14 '15 at 11:49
  • 1
    There is an OpenMP version of Caffe on Github that is competitive with the GPU port for some workloads. You might try to find it. Threading GEMM isn't always the best way to make DNN go faster... – Jeff Hammond May 14 '15 at 14:26
  • @Jeff please, post link to OpenMP caffe fork. – mrgloom Nov 23 '15 at 11:28
  • @mrgloom See https://github.com/intelcaffe. – Jeff Hammond Nov 28 '15 at 00:54
  • @Jeff what is special about this fork? I can't see any references to openmp. – mrgloom Nov 30 '15 at 08:41
  • @mrgloom Sorry, they are in the process of reworking the OpenMP stuff. There is https://github.com/intelcaffe/caffe-old/tree/openmp but I suspect you want to watch e.g. https://github.com/intelcaffe/caffe/commits/openmp-conv-relu. – Jeff Hammond Nov 30 '15 at 13:28

3 Answers3

2

@Karthik. That also works for me. One interesting discovery that I made was that using 4 threads reduces forward/backward pass during the caffe timing test by a factor of 2. However, increasing the thread count to 8 or even 24 results in f/b speed that is less than what I get with OPENBLAS_NUM_THREADS=4. Here are times for a few thread counts (tested on NetworkInNetwork model).

[#threads] [f/b time in ms]
1 223
2 150
4 113
8 125
12 144

For comparison, on a Titan X GPU the f/b pass took 1.87 ms.

GieBur
  • 137
  • 1
  • 8
1

While building OpenBLAS, you have to set the flag USE_OPENMP = 1 to enable OpenMP support. Next set Caffe to use OpenBLAS in the Makefile.config, please export the number of threads you want to use during runtime by setting up OMP_NUM_THREADS=n where n is the number of threads you want.

dipendra009
  • 299
  • 2
  • 7
-2

I found that this method works:

When you build the caffe, in your make command, do use this for 8 cores: make all -j8 and make pycaffe -j8

Also, make sure OPENBLAS_NUM_THREADS=8 is set.

This question has a full script for the same.

Community
  • 1
  • 1
Karthik Hegde
  • 173
  • 1
  • 10