4

I'm building an application using the webcam to control video games (kinda like a kinect). It uses the webcam (cv2.VideoCapture(0)), AI pose estimation (mediapipe), and custom logic to pipe inputs into dolphin emulator.

The issue is the latency. I've used my phone's hi-speed camera to record myself snapping and found latency of around 32 frames ~133ms between my hand and the frame onscreen. This is before any additional code, just a loop with video read and cv2.imshow (about 15ms)

Is there any way to decrease this latency?

I'm already grabbing the frame in a separate Thread, setting CAP_PROP_BUFFERSIZE to 0, and lowering the CAP_PROP_FRAME_HEIGHT and CAP_PROP_FRAME_WIDTH, but I still get ~133ms of latency. Is there anything else I can be doing?

Here's my code below:

class WebcamStream:
    def __init__(self, src=0):
        self.stopped = False

        self.stream = cv2.VideoCapture(src)
        self.stream.set(cv2.CAP_PROP_BUFFERSIZE, 0)
        self.stream.set(cv2.CAP_PROP_FRAME_HEIGHT, 400)
        self.stream.set(cv2.CAP_PROP_FRAME_WIDTH, 600)

        (self.grabbed, self.frame) = self.stream.read()
    
        self.hasNew = self.grabbed
        self.condition = Condition()

    def start(self):

        Thread(target=self.update, args=()).start()
        return self

    def update(self,):
        while True:
            if self.stopped: return
            
            (self.grabbed, self.frame) = self.stream.read()
            with self.condition:
                self.hasNew = True
                self.condition.notify_all()
            

    def read(self):
        if not self.hasNew:
            with self.condition:
                self.condition.wait()

        self.hasNew = False
        return self.frame

    def stop(self):
        self.stopped = True

The application needs to run in as close to real time as possible, so any reduction in latency, no matter how small would be great. Currently between the webcam latency (~133ms), pose estimation and logic (~25ms), and actual time it takes to move into the correct pose, it racks up to about 350-400ms of latency. Definitely not ideal when I'm trying to play a game.

EDIT: Here's the code I used to test the latency (Running the code on my laptop, recording my hand and screen, and counting frame difference in snapping):

if __name__ == "__main__":
    cap = WebcamStream().start()
    while(True):
        frame = cap.read()
        cv2.imshow('frame', frame)
        cv2.waitKey(1)
Yonah Karp
  • 581
  • 7
  • 22
  • 1
    I think 133ms is a pretty common latency for a usb webcam. You can look into the Playstation Eye (PS3 Eyecam) which was designed for lower latency. With some tricks you can get it to run on a PC. Other than that a industrial camera may be required – Kev1n91 Jan 05 '22 at 17:30
  • Can you show your loop? Are you using imshow with a waitKey > 1? You could decouple frame reading and imshow by threading to empty frame buffers as fast as possible and always only display the newest frame, no matter how many habe been read. However depending on what youd camera does (e.g..encoding frames to jpg or h264 before sending to the pc) there might be a high latency from camera and transferring already. What kind of webcam do you use? – Micka Jan 05 '22 at 19:13
  • @ChristophRackwitz Hi Christoph, you can actually run Python on a phone... I use **Pythonista** and **a-shell** on iPhone and iPad. – Mark Setchell Jan 05 '22 at 20:59
  • @ChristophRackwitz I'm running the code on my Mac. Using my phone's video to record my computer and hand and count the frame difference when snapping – Yonah Karp Jan 05 '22 at 21:03
  • 1
    @Micka Edited to add the loop. I'm using My macbook's built in Facetime HD camera – Yonah Karp Jan 05 '22 at 21:04
  • @ChristophRackwitz I believe you misunderstand what is happening. The phone (iPhone) is not running any Python code at all, nor is it’s camera accessed by the code. The camera was just used to test the latency as described above – Yonah Karp Jan 05 '22 at 21:37
  • 0.133s is 2 frames at 15 fps or 4 frames at 30 fps. not good but also not excessive. -- again, which camera is the subject of this discussion? you say "built in Facetime HD camera"... is that it? exact model please. I'm guessing there have been dozens of different cameras throughout the years in dozens of apple products that call themselves "macbook". what latency do you get when you **don't** use custom code, but VLC or ffmpeg or anything proven? – Christoph Rackwitz Jan 05 '22 at 23:25

1 Answers1

8

Welcome to the War-on-Latency ( shaving-off )

The experience you have described above is a bright example, how accumulated latencies could devastate any chances to keep a control-loop tight-enough, to indeed control something meaningfully stable, as in a MAN-to-MACHINE-INTERFACE system we wish to keep:

User's-motion | CAM-capture | IMG-processing | GUI-show | User's-visual-cortex-scene-capture | User's decision+action | loop

A real-world situation, where OpenCV profiling was shown, to "sense" how much time we spend in respective acquisition-storage-transformation-postprocessing-GUI pipeline actual phases ( zoom in as needed )

enter image description here


What latency-causing steps do we work with?

Forgive, for a moment, a raw-sketch of where we accumulate each of the particular latency-related costs :


   CAM \____/                                     python code GIL-awaiting ~ 100 [ms] chopping
        |::|                                      python code calling a cv2.<function>()
        |::|   __________________________________________-----!!!!!!!-----------
        |::|    ^     2x                                 NNNNN!!!!!!!MOVES DATA!
        |::|    | per-call                               NNNNN!!!!!!!    1.THERE
        |::|    |     COST                               NNNNN!!!!!!!    2.BACK
        |::|    |           TTTT-openCV::MAT into python numpy.array
        |::|    |          ////       forMAT TRANSFORMER TRANSFORMATIONS
        USBx    |         ////                           TRANSFORMATIONS
        |::|    |        ////                            TRANSFORMATIONS
        |::|    |       ////                             TRANSFORMATIONS
        |::|    |      ////                              TRANSFORMATIONS
        |::|    |     ////                               TRANSFORMATIONS
    H/W oooo   _v____TTTT in-RAM openCV::MAT storage     TRANSFORMATIONS
       /    \        oooo ------ openCV::MAT object-mapper
       \    /        xxxx
 O/S--- °°°°         xxxx
 driver """" _____   xxxx
         \\\\    ^   xxxx ...... openCV {signed|unsigned}-{size}-{N-channels}
 _________\\\\___|___++++ __________________________________________
 openCV I/O      ^   PPPP                                 PROCESSING
            as F |   ....                                 PROCESSING
               A |   ...                                  PROCESSING
               S |   ..                                   PROCESSING
               T |   .                                    PROCESSING
            as   |   PPPP                                 PROCESSING
      possible___v___PPPP _____ openCV::MAT NATIVE-object PROCESSING


What latencies do we / can we fight ( here ) against?

Hardware latencies could help, yet changing already acquired hardware could turn expensive

Software latencies of already latency-optimised toolboxes is possible, yet harder & harder

Design inefficiencies are the final & most common place, where latencies could get shaved-off


OpenCV ?
There is not much to do here. The problem is with the OpenCV-Python binding details:

... So when you call a function, say res = equalizeHist(img1,img2) in Python, you pass two numpy arrays and you expect another numpy array as the output. So these numpy arrays are converted to cv::Mat and then calls the equalizeHist() function in C++. Final result, res will be converted back into a Numpy array. So in short, almost all operations are done in C++ which gives us almost same speed as that of C++.

This works fine "outside" a control-loop, not in our case, where both of the two transport-costs, transformation-costs and any of new or interim-data storage RAM-allocation-costs result in worsening our control-loop TAT.

So avoid any and all calls of OpenCV-native functions from Python-(behind the bindings' latency extra-miles)-side, no matter how tempting or sweet these may look on the first sight.

HUNDREDS-of-[ms] are a rather bitter cost of ignoring this advice.


Python ?
Yes, Python. Using Python interpreter introduces both latency per se, plus adds problems with concurrency-avoided processing, no matter how many cores does our hardware operate on ( while recent Py3 tries a lot to lower these costs under the interpreter-level software).

We can test & squeeze max out of the (still unavoidable, in 2022) GIL-lock interleaving - check the sys.getswitchinterval() and test increasing this amount for having less interleaved python-side processing ( tweaking is dependent on other your python-application ambitions ( GUI, tasks, python network-I/O workloads, python-HW-I/O-s, if applicable, etc )


RAM-memory-I/O costs ?
Our next major enemy. Using a least-sufficient-enough image-DATA-format, that MediaPipe can work with is the way forward in this segment.


Avoidable losses
All other (our) sins belong to this segment. Avoid any image-DATA-format transformations ( see above, cost may easily grow into HUNDREDS THOUSANDS of [us] just for converting an already acquired-&-formatted-&-stored numpy.array into just another colourmap)

MediaPipe
lists enumerated formats it can work with:

 // ImageFormat

  SRGB: sRGB, interleaved:   one byte for R,
                        then one byte for G,
                        then one byte for B for each pixel.

  SRGBA: sRGBA, interleaved: one byte for R,
                             one byte for G,
                             one byte for B,
                             one byte for alpha or unused.

  SBGRA: sBGRA, interleaved: one byte for B,
                             one byte for G,
                             one byte for R,
                             one byte for alpha or unused.

  GRAY8:        Grayscale,   one byte per pixel.

  GRAY16:       Grayscale,   one uint16 per pixel.

  SRGB48:  sRGB,interleaved, each component is a uint16.

  SRGBA64: sRGBA,interleaved,each component is a uint16.

  VEC32F1:                   One float per pixel.

  VEC32F2:                   Two floats per pixel.

So, choose the MVF -- the minimum viable format -- for gesture-recognition to work and downscale the amount of pixels as possible ( 400x600-GRAY8 would be my hot candidate )

Pre-configure ( not missing the cv.CAP_PROP_FOURCC details ) the native-side OpenCV::VideoCapture processing to do no more than just plain storing this MVF in a RAW-format on the native-side of the Acquisition-&-Pre-processing chain, so that no other post-process formatting takes place.

If indeed forced to ever touch the python-side numpy.array object, prefer to use vectorised & striding-tricks powered operations over .view()-s or .data-buffers, so as to avoid any unwanted add-on latency costs increasing the control-loop TAT.


Options?

  • eliminate any python-side calls ( as these cost you --2x--the-costs of data-I/O + transformation costs ) by precisely configuring the native-side OpenCV processing to match the needed MediaPipe data-format

  • minimise, better avoid any blocking, if still too skewed control-loop, try using with moving raw-data into other process ( not necessarily a Python-interpreter ) on localhost or within a sub-ms LAN domain ( further tips available here )

  • try to fit the hot-DATA RAM-footprints to match you CPU-Cache Hierarchy cache-lines' sizing & associativity details ( see this )

user3666197
  • 1
  • 6
  • 50
  • 92