Determine skeleton joints with a webcam (not Kinect)

Question

I'm trying to determine skeleton joints (or at the very least to be able to track a single palm) using a regular webcam. I've looked all over the web and can't seem to find a way to do so.

Every example I've found is using Kinect. I want to use a single webcam.

There's no need for me to calculate the depth of the joints - I just need to be able to recognize their X, Y position in the frame. Which is why I'm using a webcam, not a Kinect.

So far I've looked at:

OpenCV (the "skeleton" functionality in it is a process of simplifying graphical models, but it's not a detection and/or skeletonization of a human body).
OpenNI (with NiTE) - the only way to get the joints is to use the Kinect device, so this doesn't work with a webcam.

I'm looking for a C/C++ library (but at this point would look at any other language), preferably open source (but, again, will consider any license) that can do the following:

Given an image (a frame from a webcam) calculate the X, Y positions of the visible joints
[Optional] Given a video capture stream call back into my code with events for joints' positions
Doesn't have to be super accurate, but would prefer it to be very fast (sub-0.1 sec processing time per frame)

Would really appreciate it if someone can help me out with this. I've been stuck on this for a few days now with no clear path to proceed.

UPDATE

2 years later a solution was found: http://dlib.net/imaging.html#shape_predictor

This is really difficult with a single webcam, even more so in real time. Hence the Kinect. To only track a single palm you should be able to modify this real time tracker to do the job: http://www4.comp.polyu.edu.hk/~cslzhang/CT/CT.htm. IT works really well and their C++ code uses OpenCV. — Bull, Jun 15 '13 at 14:55
It would help if you would give a little bit more context, so we have an idea why it should absolutely not involve Kinect (and maybe suggest a viable alternative within the bounds of this context) — Grimace of Despair, Jun 24 '13 at 15:27
Since your using an infrared camera I imagine you have infrared LEDs somewhere? — Menelaos, Jul 02 '13 at 18:54
Hi, I just want to ask if you've been able to proceed with this. Currently I am also looking at skeletonization but can't use OpenNI or any other NI libraries targeted for Kinect use. Currently we've been able to proceed with our project using image processing and analysis based on data collected but I'd rather have skeleton tracking moving forward. — IBG, Jan 30 '14 at 09:14
So far... no :( The only thing that even came close (based on claims) was XTR3D, but they failed to deliver. Failed so miserably... Their code wouldn't even launch, and tech support was not only less than useful but turned out to be extremely rude and dishonest. Personally I vowed to never deal with that company again. — YePhIcK, Feb 02 '14 at 21:14
@YePhick Hi, I work at Extreme Reality as an Algorithms engineer, we have noticed your comment and we are sorry for your bad experiance. Please feel free to download our SDK for multiple platforms here (http://www.xtr3d.com/developers/sdk-download/) and contact support@xtr3d.com for any issue that may occur. We would love to help you out. Yonatan — Yonatan Simson, Dec 03 '15 at 08:18
@YonatanSimson thank you for your attention. I suppose it *has* been almost 2 years since then and the horrible aftertaste has dulled down a bit. I'll give it a go :) — YePhIcK, Dec 03 '15 at 16:31
Downloaded, installed, tried to compile the C++ sample (CConsoleSample) - **failed** for both Debug and Release (using MSVC 2015), uninstalled, **manually cleaned up** the clutter left behind. Vowed to never deal with XTR3D again. Thanks, but no thanks. — YePhIcK, Dec 03 '15 at 16:45
Currently our SDK doesn’t support vs2015, but nevertheless when building after the default installation of vs2015 I got an error - fatal error RC1015: cannot open include file 'afxres.h'. A quick Google search told me I had to install MFC for C++ (Programming Languages -> Visual C++ -> Microsoft Foundation Classes for C++), which I did, and the sample compiled without any more problems and ran. — Yonatan Simson, Dec 10 '15 at 15:59
I'm glad it worked for you. I have MFC installed (with the sources, too) and it didn't work for me. And considering the amount of time I have already wasted in the past I'm not going to take anything less than an effort-less process. I'm sorry to be such a pain but I'm trying to be as polite and as cooperative here as I can and avoiding the detailed account of the full range of frustration I have experienced when dealing with the XTR3D in the past. — YePhIcK, Dec 10 '15 at 20:38

Matěj Šmíd · Answer 1 · 2013-06-25T15:00:32.647

19

To track a hand using a single camera without depth information is a serious task and topic of ongoing scientific work. I can supply you a bunch of interesting and/or highly cited scientific papers on the topic:

M. de La Gorce, D. J. Fleet, and N. Paragios, “Model-Based 3D Hand Pose Estimation from Monocular Video.,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, Feb. 2011.
R. Wang and J. Popović, “Real-time hand-tracking with a color glove,” ACM Transactions on Graphics (TOG), 2009.
B. Stenger, A. Thayananthan, P. H. S. Torr, and R. Cipolla, “Model-based hand tracking using a hierarchical Bayesian filter.,” IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 9, pp. 1372–84, Sep. 2006.
J. M. Rehg and T. Kanade, “Model-based tracking of self-occluding articulated objects,” in Proceedings of IEEE International Conference on Computer Vision, 1995, pp. 612–617.

Hand tracking literature survey in the 2nd chapter:

T. de Campos, “3D Visual Tracking of Articulated Objects and Hands,” 2006.

Unfortunately I don't know about some freely available hand tracking library.

edited Jun 25 '13 at 15:00

answered Jun 18 '13 at 14:21

Matěj Šmíd

766
4
11

2

I do not require a depth information - only the pixel position (or a center) of an object in camera's view. – YePhIcK Jun 18 '13 at 15:35
To track an articulated 3D object including position of its joints is to my knowledge usually done by recovering the complete 3D position and orientation. Simply you get also the depth even when you don't need it. – Matěj Šmíd Jun 25 '13 at 09:50
What you are describing requires a stereo vision, which is not what I have listed in the requirements (a single webcam) – YePhIcK Jun 25 '13 at 14:08
I thought that all of them were using a single camera, but some multi camera papers went through by mistake. I removed one that used multiple cameras and marked the thesis by Campos that includes possibly helpful literature survey. The rest is really a single view reconstruction of the hand pose and orientation. But the implementation would be hard and performance can be unsatisfactory for your application. – Matěj Šmíd Jun 25 '13 at 15:08
Due to current constraints I'm looking for an implemented solution that is ready-to-use – YePhIcK Jul 06 '13 at 22:48

samkhan13 · Answer 2 · 2013-07-01T03:36:36.463

there is a simple way for detecting hand using skin tone. perhaps this could help... you can see the results on this youtube video. caveat: the background shouldn't contain skin colored things like wood.

here is the code:

''' Detect human skin tone and draw a boundary around it.
Useful for gesture recognition and motion tracking.

Inspired by: http://stackoverflow.com/a/14756351/1463143

Date: 08 June 2013
'''

# Required moduls
import cv2
import numpy

# Constants for finding range of skin color in YCrCb
min_YCrCb = numpy.array([0,133,77],numpy.uint8)
max_YCrCb = numpy.array([255,173,127],numpy.uint8)

# Create a window to display the camera feed
cv2.namedWindow('Camera Output')

# Get pointer to video frames from primary device
videoFrame = cv2.VideoCapture(0)

# Process the video frames
keyPressed = -1 # -1 indicates no key pressed

while(keyPressed < 0): # any key pressed has a value >= 0

    # Grab video frame, decode it and return next video frame
    readSucsess, sourceImage = videoFrame.read()

    # Convert image to YCrCb
    imageYCrCb = cv2.cvtColor(sourceImage,cv2.COLOR_BGR2YCR_CB)

    # Find region with skin tone in YCrCb image
    skinRegion = cv2.inRange(imageYCrCb,min_YCrCb,max_YCrCb)

    # Do contour detection on skin region
    contours, hierarchy = cv2.findContours(skinRegion, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

    # Draw the contour on the source image
    for i, c in enumerate(contours):
        area = cv2.contourArea(c)
        if area > 1000:
            cv2.drawContours(sourceImage, contours, i, (0, 255, 0), 3)

    # Display the source image
    cv2.imshow('Camera Output',sourceImage)

    # Check for user input to close program
    keyPressed = cv2.waitKey(1) # wait 1 milisecond in each iteration of while loop

# Close window and camera after exiting the while loop
cv2.destroyWindow('Camera Output')
videoFrame.release()

the cv2.findContour is quite useful, you can find the centroid of a "blob" by using cv2.moments after u find the contours. have a look at the opencv documentation on shape descriptors.

i havent yet figured out how to make the skeletons that lie in the middle of the contour but i was thinking of "eroding" the contours till it is a single line. in image processing the process is called "skeletonization" or "morphological skeleton". here is some basic info on skeletonization.

here is a link that implements skeletonization in opencv and c++

here is a link for skeletonization in opencv and python

hope that helps :)

--- EDIT ----

i would highly recommend that you go through these papers by Deva Ramanan (scroll down after visiting the linked page): http://www.ics.uci.edu/~dramanan/

C. Desai, D. Ramanan. "Detecting Actions, Poses, and Objects with Relational Phraselets" European Conference on Computer Vision (ECCV), Florence, Italy, Oct. 2012.
D. Park, D. Ramanan. "N-Best Maximal Decoders for Part Models" International Conference on Computer Vision (ICCV) Barcelona, Spain, November 2011.
D. Ramanan. "Learning to Parse Images of Articulated Objects" Neural Info. Proc. Systems (NIPS), Vancouver, Canada, Dec 2006.

Thank you, that was helpful. Unfortunately doesn't work for my needs - I am using a near-IR wavelength and it's much-much harder to predict the "color" of the background. As for the skeletonization - I have looked at it (see my initial post) and so far I don't have a good feeling about it in terms of translating a human outline into a skeleton. That probably only works if I stand with my legs and hands spread ;) — YePhIcK, Jun 30 '13 at 00:49
nearIR is interesting, but is there a special reason to use that range of spectrum? a normal camera, i would suspect, should do the job. the alternative is to put "markers" on the joints that you are interested in and use a typical camera to detect them; using opencv you can draw a line between the detected points. there are ways of obtaining [3d information from single camera](http://stackoverflow.com/a/17088281/1463143). — samkhan13, Jun 30 '13 at 13:12
@YePhIcK some more info on articulated body parts added to answer :) — samkhan13, Jul 01 '13 at 03:40
The type is camera and color information is very important. That your using a near-IR wavelength camera should be added to the original question. — Menelaos, Jul 02 '13 at 18:45
@samkhan13 yes, there's a specific reason I'm using the hardware that I'm using. Can't get into that though as I'm under NDA. In my case "markers" are best to be avoided - the solution should be generic enough to be independent of skin color recognition, of markers being put on joints, be fast, and not require Haar (or any similar) training — YePhIcK, Jul 06 '13 at 22:52
Where did you get the values for `min_YCrCb` and `max_YCrCb`? Was it trial and error or did you read somewhere that those values work best? — Matt, Dec 25 '16 at 07:49
@MattD the values for the thresholds were originally inspired from: http://stackoverflow.com/a/14756351/1463143 — samkhan13, Dec 25 '16 at 14:00
I had the following error: ValueError: too many values to unpack I fixed the error with this post https://stackoverflow.com/questions/25504964/opencv-python-valueerror-too-many-values-to-unpack — Juan Zamora, May 16 '18 at 05:19

score 2 · Answer 3 · answered Jun 20 '13 at 15:36

The most common approach can be seen in the following youtube video. http://www.youtube.com/watch?v=xML2S6bvMwI

This method is not quite robust, as it tends to fail if the hand is rotated to much (eg; if the camera is looking at the side of the hand or at a partially bent hand).

If you do not mind using two camera's you can look into the work Robert Wang. His current company (3GearSystems) uses this technology, augmented with a kinect, to provide tracking. His original paper uses two webcams but has much worse tracking.

Wang, Robert, Sylvain Paris, and Jovan Popović. "6d hands: markerless hand-tracking for computer aided design." Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, 2011.

Another option (again if using "more" than a single webcam is possible), is to use a IR emitter. Your hand reflects IR light quite well whereas the background does not. By adding a filter to the webcam that filters normal light (and removing the standard filter that does the opposite) you can create a quite effective hand tracking. The advantage of this method is that the segmentation of the hand from the background is much simpler. Depending on the distance and the quality of the camera, you would need more IR leds, in order to reflect sufficient light back into the webcam. The leap motion uses this technology to track the fingers & palms (it uses 2 IR cameras and 3 IR leds to also get depth information).

All that being said; I think the Kinect is your best option in this. Yes, you don't need the depth, but the depth information does make it a lot easier to detect the hand (using the depth information for the segmentation).

Thank you for your suggestions, but I'm specifically looking for a non-Kinect solution. Very specifically :) — YePhIcK, Jun 24 '13 at 05:08
Unfortunately, these don't exist within the parameters you've given. — Nallath, Jun 24 '13 at 06:51
@Nallath adobe uses face tracking and I thhkn partial limb tracking using only 1 webcam for adobe animate I'm pretty sure — B''H Bi'ezras -- Boruch Hashem, Jan 23 '20 at 05:13

score 2 · Answer 4 · answered Jun 24 '13 at 15:28

My suggestion, given your constraints, would be to use something like this: http://docs.opencv.org/doc/tutorials/objdetect/cascade_classifier/cascade_classifier.html

Here is a tutorial for using it for face detection: http://opencv.willowgarage.com/wiki/FaceDetection?highlight=%28facial%29|%28recognition%29

The problem you have described is quite difficult, and I'm not sure that trying to do it using only a webcam is a reasonable plan, but this is probably your best bet. As explained here (http://docs.opencv.org/modules/objdetect/doc/cascade_classification.html?highlight=load#cascadeclassifier-load), you will need to train the classifier with something like this:

http://docs.opencv.org/doc/user_guide/ug_traincascade.html

Remember: Even though you don't require the depth information for your use, having this information makes it easier for the library to identify a hand.

score 2 · Accepted Answer · answered Apr 08 '17 at 22:54

2

At last I've found a solution. Turns out a dlib open-source project has a "shape predictor" that, once properly trained, does exactly what I need: it guesstimates (with a pretty satisfactory accuracy) the "pose". A "pose" is loosely defined as "whatever you train it to recognize as a pose" by training it with a set of images, annotated with the shapes to extract from them.

The shape predictor is described in here on dlib's website

answered Apr 08 '17 at 22:54

YePhIcK

5,816
2
27
52

there is also pre trained models available, for example I used a frontal facial pose detector some time back. – Divij Sehgal Jul 14 '18 at 22:29
1

Definitely google once to find out if a model is already available that does what you want it to do. Essentially, its just trained feature weights. – Divij Sehgal Jul 14 '18 at 22:30

score 0 · Answer 6 · answered Jun 24 '13 at 13:34

0

I don't know about possible existing solutions. If supervised (or semi-supervised) learning is an option, training decision trees or neural networks might already be enough (kinect uses random forests from what i have heard). Before you go such a path, do everything you can to find an existing solution. Getting Machine Learning stuff right takes a lot of time and experimentation.

OpenCV has machine learning components, what you would need is training data.

answered Jun 24 '13 at 13:34

kutschkem

7,826
3
21
56

I've been playing with OpenCV's recognition components for a while now and have to say they tend to be quite bulky and not as accurate as I'd like them to be. Though so far that seems to be one of the very few viable options... Doesn't meet all the requirements I need, but at least comes somewhat close – YePhIcK Jun 24 '13 at 22:56

score 0 · Answer 7 · answered Jun 24 '13 at 14:30

0

With the motion tracking features of the open source Blender project it is possible to create a 3D model based on 2D footage. No kinect needed. Since blender is open source you might be able to use their pyton scripts outside the blender framework for your own purposes.

answered Jun 24 '13 at 14:30

Ruut

1,091
1
16
29

That link to YouTube you put in here is jaw-dropping, truly amazing. But completely irrelevant to what I need :( – YePhIcK Jun 24 '13 at 22:52
It uses structure from motion. It uses the fact that the object you want to "scan" is at a location/orientation compared to the camera at each frame to estimate depths. – Nallath Jun 28 '13 at 07:58
Once again - I don't need the depth (I do the depth myself using a different method), I just need to know "where" on the 2D image the object I'm looking for is :) – YePhIcK Jul 06 '13 at 22:46

score 0 · Answer 8 · answered Jul 01 '13 at 00:26

Have you ever heard about Eyesweb

I have been using it for one of my project and I though it might be usefull for what you want to achieve. Here are some interesting publication LNAI 3881 - Finger Tracking Methods Using EyesWeb and Powerpointing-HCI using gestures

Basically the workflow is:

You create your patch in EyesWeb
Prepare the datas you want to send with a network client
Use theses processed datas on your own server (your app)

However, I don't know if there is a way to embed the real time image processing part of Eyes Web into a soft as a library.

Determine skeleton joints with a webcam (not Kinect)

8 Answers8