4

When defining a network in Caffe/Caffe2, can you place some of the nodes on the CPU and others on GPU? If so, how?

(If your answer pertains a specific version of Caffe, please specify which)

MWB
  • 11,740
  • 6
  • 46
  • 91

4 Answers4

1

No, it's not possible. If you look at the solver.prototxt file you'll notice that you may specify the mode as either CPU or GPU, but not both. The reason for keeping this kind of execution structure is to maintain efficiency. The data generated by each layer of a CNN maybe in megabytes. If you keep part of the network on the CPU and part of it on the GPU, you'll need to transfer huge chunks of data to and fro between the devices. This will add a huge overhead which will completely undo the leverage given by the GPU. So, it is more efficient to train the entire network on the CPU rather than the CPU-GPU combination. Also note that the GPU is connected with the CPU via a PCIe interface which is significantly slower than the internal CPU bus. So data transfer between the devices is really expensive. That's one of the reasons why larger batch sizes are preferred for training CNNs as a bunch of images can sent to the GPU at once, avoiding repetitive memory reads and writes.

Harsh Wardhan
  • 2,110
  • 10
  • 36
  • 51
  • Thanks for the first 2 sentences, but I think the rest is incorrect. (1) PCIe bandwidth is comparable to RAM bandwidth *if you set it up right* (2) It's actually common to keep word2vec parameters on the CPU while doing the rest on the GPU. – MWB Jun 05 '17 at 13:08
  • .. because GPU RAM is limited. – MWB Jun 05 '17 at 13:11
  • I think there's a bit of misunderstanding here. I meant to say that CPU internal buses are quicker than PCIe, though this piece of info may not be exactly relevant here. Coming to your 2nd point - GPUs are good with SIMD operations but CPUs are better with operations involving branching/decision making. word2vec involves decision making so it is more efficient on CPU. But it is a preprocessing step which can be done on the CPU in the initial phases but that does not imply you can do to and fro stuff all the time. Preprocessing is not part of the network in the true sense. – Harsh Wardhan Jun 05 '17 at 13:16
  • Also note that I was talking from the perspective of images. But even in the case of images, the preprocessing is usually done on the CPU itself. BTW GPU RAMs these days are as much as CPU RAMs if you're using the higher end cards like the Titan X. – Harsh Wardhan Jun 05 '17 at 13:19
1

This might be actually possible in Caffe2 but I have never tested it. In Caffe2, every blob and operator has a device assigned to it. An operator runs on the device assigned to it. But you would then need to manually take care of initialization and communication because data_parallel_model in Caffe2 is only equipped for multi-GPU setup.

1

Generally speaking, the answer is NO: you cannot configure the device for each layer independently for the reasons Pooya Davoodi and Harsh Wardhan described.

However, if you look at specific layers, you might sometimes get the behavior you look for. For instance, if your solver is configured to run on GPU, but you have a layer in your net that does not have a GPU implementation, then this layer will run on the CPU (with all the overhead described in Harsh Wardhan's answer).
One such layer is a "Python" layer: This layer runs only on CPU and you may have your word2vec implementation there.
Alternatively, you may write your own layers without GPU implementation making sure they run only on CPU.


BTW, are you using caffe2? Are you okay with their PATENTS clause?!
UPDATE: it seems like fb decided to soften caffe2's license. Well done!

Shai
  • 111,146
  • 38
  • 238
  • 371
  • **"Are you okay with their PATENTS clause?!"** -- looks identical to Apache (everything Google releases), but IANAL. – MWB Jun 06 '17 at 06:19
  • @MaxB (1) IANAL either, but as far as I understand it is more restrictive and includes ANY patent dispute, even for patents not related to caffe2. (2) It is quite disturbing considering caffe1 does not have this restriction at all, and that caffe strength comes mainly from freely distributed *models* developed by the community and not FB. – Shai Jun 06 '17 at 06:23
0

Using DeviceScope with the relevant DeviceOption (CPU / GPU) device_type before creating the required Node and it's Blobs

simple example:

from caffe2.python import workspace, model_helper
from caffe2.proto import caffe2_pb2
from caffe2.python import core
import numpy as np

m = model_helper.ModelHelper(name="my first net")
data = np.random.rand(16, 100).astype(np.float32)
gpu_device_id = 1
cpu_device_id = -1
with core.DeviceScope(core.DeviceOption(workspace.GpuDeviceType, gpu_device_id)):
    with core.DeviceScope(core.DeviceOption(caffe2_pb2.CPU, cpu_device_id)):
        # Feed relevant blobs
        workspace.FeedBlob("data", data)
        weight = m.param_init_net.XavierFill([], 'fc_w', shape=[10, 100])
        bias = m.param_init_net.ConstantFill([], 'fc_b', shape=[10, ])
        # Create you cpu Node
        fc_1 = m.net.FC(["data", "fc_w", "fc_b"], "fc1")
    # Create GPU Node
    pred = m.net.Sigmoid(fc_1, "pred")
    softmax, loss = m.net.SoftmaxWithLoss([pred, "label"], ["softmax", "loss"])

print(m.net.Proto())
  • Don't forget to do the same before Feed the it's blobs otherwise, you op want be able to reach them.

The output is:

name: "my first net"
op {
name: "my first net"
op {
  input: "data"
  input: "fc_w"
  input: "fc_b"
  output: "fc1"
  name: ""
  type: "FC"
  device_option {
    device_type: 0
    device_id: -1
  }
}
op {
  input: "fc1"
  output: "pred"
  name: ""
  type: "Sigmoid"
  device_option {
    device_type: 1
    device_id: 1
  }
}
op {
  input: "pred"
  input: "label"
  output: "softmax"
  output: "loss"
  name: ""
  type: "SoftmaxWithLoss"
  device_option {
    device_type: 1
    device_id: 1
  }
}
external_input: "data"
external_input: "fc_w"
external_input: "fc_b"
external_input: "label"