I'm having trouble running inference on a model in docker when the host has several cores. The model is exported via PyTorch 1.0 ONNX exporter:
torch.onnx.export(pytorch_net, dummyseq, ONNX_MODEL_PATH)
Starting the model server (wrapped in Flask) with a single core yields acceptable performance (cpuset pins the process to specific cpus) docker run --rm -p 8081:8080 --cpus 0.5 --cpuset-cpus 0 my_container
response from ab -c 1 -n 1000 http://0.0.0.0:8081/predict\?itemids\=5,100
Percentage of the requests served within a certain time (ms)
50% 5
66% 5
75% 5
80% 5
90% 7
95% 46
98% 48
99% 49
But pinning it to four cores gives completely different stats for the same ab-call docker run --rm -p 8081:8080 --cpus 0.5 --cpuset-cpus 0,1,2,3 my_container
Percentage of the requests served within a certain time (ms)
50% 9
66% 12
75% 14
80% 18
90% 62
95% 66
98% 69
99% 69
100% 77 (longest request)
Model inference is done like this, and except this issue it seems to work as expected. (This runs in a completely separate environment from the model export of course)
from caffe2.python import workspace
from caffe2.python.onnx.backend import Caffe2Backend as c2
from onnx import ModelProto
class Model:
def __init__(self):
self.predictor = create_caffe2_predictor(path)
@staticmethod
def create_caffe2_predictor(onnx_file_path):
with open(onnx_file_path, 'rb') as onnx_model:
onnx_model_proto = ModelProto()
onnx_model_proto.ParseFromString(onnx_model.read())
init_net, predict_net = c2.onnx_graph_to_caffe2_net(onnx_model_proto)
predictor = workspace.Predictor(init_net, predict_net)
return predictor
def predict(self, numpy_array):
return self.predictor.run({'0': numpy_array})
** wrapper flask app which calls Model.predict() on calls to /predict **
OMP_NUM_THREADS=1
is also present in the container environment, which had some effect, but it is not the end issue.
The benchmark stats you're seeing here are run on a local machine with 8 hyperthreads, so I shouldn't be saturating my machine and affect the test. These results also show up in my kubernetes environment, and I'm getting a large amount of CFS (Completely Fair Scheduler) throttling there.
I'm running in a kubernetes environment, so there's no way for me to control how many CPUs the host exposes, and doing some sort of pinning there seems a bit hacky as well.
Is there any way to pin caffe2 model inference to a single processor? Am I doing something obviously wrong here? Is the caffe2.Predictor object not suited to this task?
Any help appreciated.
EDIT:
I've added the simplest possible reproducable example I can think of here, with a docker-container and run-script included: https://github.com/NegatioN/Caffe2Struggles