I'm trying to benchmark Windows ML against other backends and see some weird distribution of inference times (see plot). This is with the CPU backend using the ARM64 architecture. On ARM there's no bimodal distribution.
I don't have a good intuition on why there's two modes in the distribution of inference times. There doesn't seem to be a temporal correlation, I run the network once per second and it switches between the "slow" and "fast" mode seemingly randomly.
One of the guesses I have is that perhaps sometimes Windows ML decides to use two threads and sometimes one, possibly depending on estimated device load. However, unlike with TensorFlow Lite or Caffe2 I haven't found a way to control the number of threads Windows ML uses. So the question is:
Is there a way to control the number of threads Windows ML is using for evaluation in CPU mode, or is it guaranteed to use only one thread for computation in any case?
Other pointers to what could cause this weird behavior are also welcome.