Caffe2: Load ONNX model, and inference single threaded on multi-core host / docker

Question

I'm having trouble running inference on a model in docker when the host has several cores. The model is exported via PyTorch 1.0 ONNX exporter:

torch.onnx.export(pytorch_net, dummyseq, ONNX_MODEL_PATH)

Starting the model server (wrapped in Flask) with a single core yields acceptable performance (cpuset pins the process to specific cpus) docker run --rm -p 8081:8080 --cpus 0.5 --cpuset-cpus 0 my_container

response from ab -c 1 -n 1000 http://0.0.0.0:8081/predict\?itemids\=5,100

Percentage of the requests served within a certain time (ms)
  50%      5
  66%      5
  75%      5
  80%      5
  90%      7
  95%     46
  98%     48
  99%     49

But pinning it to four cores gives completely different stats for the same ab-call docker run --rm -p 8081:8080 --cpus 0.5 --cpuset-cpus 0,1,2,3 my_container

Percentage of the requests served within a certain time (ms)
  50%      9
  66%     12
  75%     14
  80%     18
  90%     62
  95%     66
  98%     69
  99%     69
 100%     77 (longest request)

Model inference is done like this, and except this issue it seems to work as expected. (This runs in a completely separate environment from the model export of course)

from caffe2.python import workspace
from caffe2.python.onnx.backend import Caffe2Backend as c2
from onnx import ModelProto


class Model:
    def __init__(self):
        self.predictor = create_caffe2_predictor(path)

    @staticmethod
    def create_caffe2_predictor(onnx_file_path):
        with open(onnx_file_path, 'rb') as onnx_model:
            onnx_model_proto = ModelProto()
            onnx_model_proto.ParseFromString(onnx_model.read())
            init_net, predict_net = c2.onnx_graph_to_caffe2_net(onnx_model_proto)
            predictor = workspace.Predictor(init_net, predict_net)
        return predictor


    def predict(self, numpy_array):
        return self.predictor.run({'0': numpy_array})

** wrapper flask app which calls Model.predict() on calls to /predict **

OMP_NUM_THREADS=1 is also present in the container environment, which had some effect, but it is not the end issue.

The benchmark stats you're seeing here are run on a local machine with 8 hyperthreads, so I shouldn't be saturating my machine and affect the test. These results also show up in my kubernetes environment, and I'm getting a large amount of CFS (Completely Fair Scheduler) throttling there.

I'm running in a kubernetes environment, so there's no way for me to control how many CPUs the host exposes, and doing some sort of pinning there seems a bit hacky as well.

Is there any way to pin caffe2 model inference to a single processor? Am I doing something obviously wrong here? Is the caffe2.Predictor object not suited to this task?

Any help appreciated.

EDIT:

I've added the simplest possible reproducable example I can think of here, with a docker-container and run-script included: https://github.com/NegatioN/Caffe2Struggles

NegatioN · Accepted Answer · 2019-03-19T17:15:04.473

This is not a direct answer to the question, but if your goal is to serve PyTorch models (and only PyTorch models, as mine is now) in production, simply using PyTorch Tracing seems to be the better choice.

You can then load it directly into a C++ frontend similarly to what you would do through Caffe2, but PyTorch tracing seems more well maintained. From what I can see there are no speed slowdowns, but it is a whole lot easier to configure.

An example of this to get good performance on a single-core container is to run with OMP_NUM_THREADS=1 as before, and export the model as follows:

from torch import jit
### Create a model
model.eval()
traced = jit.trace(model, torch.from_numpy(an_array_with_input_size))
traced.save("traced.pt")

And then simply run the model in production in pure C++ following the above guide, or through the Python interface as such:

from torch import jit
model = jit.load("traced.pt")
output = model(some_input)

score 0 · Answer 2 · answered Mar 13 '19 at 23:18

0

I think this should work:

workspace.GlobalInit(["caffe2", "--caffe2_omp_num_threads=1"])

answered Mar 13 '19 at 23:18

Sergii Dymchenko

6,890
1
21
46

Hi Sergey. Sadly that didn't change anything. I'm pretty sure having `OMP_NUM_THREADS=1` exported to environment gives the exact same result. Are there any other likely sinners to evaluate? Or any easy way to find the docs for those command line parameters / globalInit parameters that we can set on caffe2? Thanks for the assistance – NegatioN Mar 14 '19 at 08:24
Do you see something about setting omp_num_threads in the logs? Here is the logic for related flags and environments variables, with logging: https://github.com/pytorch/pytorch/blob/master/caffe2/core/init_omp.cc – Sergii Dymchenko Mar 14 '19 at 08:35
Yes, the logs at least state that both MKL and OMP have one thread. `[V init_omp.cc:69] Caffe2 running with 1 MKL threads` `[V init_omp.cc:42] Caffe2 running with 1 OMP threads` – NegatioN Mar 14 '19 at 09:29
I've added a repo with a docker container to hopefully reproduce the results if you would be willing to look at it: https://github.com/NegatioN/Caffe2Struggles – NegatioN Mar 14 '19 at 10:32
1

I think this might be because caffe2 is explicitly setting the threadpool size to be as large as the number of logical cpus https://github.com/pytorch/pytorch/issues/18013 . In that case, it seems like there's no other option than having someone patch it with a fix. – NegatioN Mar 14 '19 at 13:19

Caffe2: Load ONNX model, and inference single threaded on multi-core host / docker

2 Answers2