Welcome to the War-on-Latency ( shaving-off )
The experience you have described above is a bright example, how accumulated latencies could devastate any chances to keep a control-loop tight-enough, to indeed control something meaningfully stable, as in a MAN-to-MACHINE-INTERFACE system we wish to keep:
User's-motion | CAM-capture | IMG-processing | GUI-show | User's-visual-cortex-scene-capture | User's decision+action | loop
A real-world situation, where OpenCV profiling was shown, to "sense" how much time we spend in respective acquisition-storage-transformation-postprocessing-GUI pipeline actual phases ( zoom in as needed )

What latency-causing steps do we work with?
Forgive, for a moment, a raw-sketch of where we accumulate each of the particular latency-related costs :
CAM \____/ python code GIL-awaiting ~ 100 [ms] chopping
|::| python code calling a cv2.<function>()
|::| __________________________________________-----!!!!!!!-----------
|::| ^ 2x NNNNN!!!!!!!MOVES DATA!
|::| | per-call NNNNN!!!!!!! 1.THERE
|::| | COST NNNNN!!!!!!! 2.BACK
|::| | TTTT-openCV::MAT into python numpy.array
|::| | //// forMAT TRANSFORMER TRANSFORMATIONS
USBx | //// TRANSFORMATIONS
|::| | //// TRANSFORMATIONS
|::| | //// TRANSFORMATIONS
|::| | //// TRANSFORMATIONS
|::| | //// TRANSFORMATIONS
H/W oooo _v____TTTT in-RAM openCV::MAT storage TRANSFORMATIONS
/ \ oooo ------ openCV::MAT object-mapper
\ / xxxx
O/S--- °°°° xxxx
driver """" _____ xxxx
\\\\ ^ xxxx ...... openCV {signed|unsigned}-{size}-{N-channels}
_________\\\\___|___++++ __________________________________________
openCV I/O ^ PPPP PROCESSING
as F | .... PROCESSING
A | ... PROCESSING
S | .. PROCESSING
T | . PROCESSING
as | PPPP PROCESSING
possible___v___PPPP _____ openCV::MAT NATIVE-object PROCESSING
What latencies do we / can we fight ( here ) against?
Hardware latencies could help, yet changing already acquired hardware could turn expensive
Software latencies of already latency-optimised toolboxes is possible, yet harder & harder
Design inefficiencies are the final & most common place, where latencies could get shaved-off
OpenCV ?
There is not much to do here. The problem is with the OpenCV-Python binding details:
... So when you call a function, say res = equalizeHist(img1,img2)
in Python, you pass two numpy arrays and you expect another numpy
array as the output. So these numpy arrays are converted to cv::Mat
and then calls the equalizeHist()
function in C++. Final result, res will be converted back into a Numpy array. So in short, almost all operations are done in C++ which gives us almost same speed as that of C++.
This works fine "outside" a control-loop, not in our case, where both of the two transport-costs, transformation-costs and any of new or interim-data storage RAM-allocation-costs result in worsening our control-loop TAT
.
So avoid any and all calls of OpenCV-native functions from Python-(behind the bindings' latency extra-miles)-side, no matter how tempting or sweet these may look on the first sight.
HUNDREDS-of-[ms]
are a rather bitter cost of ignoring this advice.
Python ?
Yes, Python. Using Python interpreter introduces both latency per se, plus adds problems with concurrency-avoided processing, no matter how many cores does our hardware operate on ( while recent Py3 tries a lot to lower these costs under the interpreter-level software).
We can test & squeeze max out of the (still unavoidable, in 2022) GIL-lock interleaving - check the sys.getswitchinterval()
and test increasing this amount for having less interleaved python-side processing ( tweaking is dependent on other your python-application ambitions ( GUI, distributed-computing tasks, python network-I/O workloads, python-HW-I/O-s, if applicable, etc )
RAM-memory-I/O costs ?
Our next major enemy. Using a least-sufficient-enough image-DATA-format, that MediaPipe can work with is the way forward in this segment.
Avoidable losses
All other (our) sins belong to this segment. Avoid any image-DATA-format transformations ( see above, cost may easily grow into HUNDREDS THOUSANDS of [us]
just for converting an already acquired-&-formatted-&-stored numpy.array
into just another colourmap)
MediaPipe
lists enumerated formats it can work with:
// ImageFormat
SRGB: sRGB, interleaved: one byte for R,
then one byte for G,
then one byte for B for each pixel.
SRGBA: sRGBA, interleaved: one byte for R,
one byte for G,
one byte for B,
one byte for alpha or unused.
SBGRA: sBGRA, interleaved: one byte for B,
one byte for G,
one byte for R,
one byte for alpha or unused.
GRAY8: Grayscale, one byte per pixel.
GRAY16: Grayscale, one uint16 per pixel.
SRGB48: sRGB,interleaved, each component is a uint16.
SRGBA64: sRGBA,interleaved,each component is a uint16.
VEC32F1: One float per pixel.
VEC32F2: Two floats per pixel.
So, choose the MVF -- the minimum viable format -- for gesture-recognition to work and downscale the amount of pixels as possible ( 400x600-GRAY8
would be my hot candidate )
Pre-configure ( not missing the cv.CAP_PROP_FOURCC
details ) the native-side OpenCV::VideoCapture
processing to do no more than just plain storing this MVF in a RAW-format on the native-side of the Acquisition-&-Pre-processing chain, so that no other post-process formatting takes place.
If indeed forced to ever touch the python-side numpy.array
object, prefer to use vectorised & striding-tricks powered operations over .view()
-s or .data
-buffers, so as to avoid any unwanted add-on latency costs increasing the control-loop TAT
.
Options?
eliminate any python-side calls ( as these cost you --2x--the-costs of data-I/O + transformation costs ) by precisely configuring the native-side OpenCV processing to match the needed MediaPipe data-format
minimise, better avoid any blocking, if still too skewed control-loop, try using distributed-processing with moving raw-data into other process ( not necessarily a Python-interpreter ) on localhost or within a sub-ms LAN domain ( further tips available here )
try to fit the hot-DATA RAM-footprints to match you CPU-Cache Hierarchy cache-lines' sizing & associativity details ( see this )