7

I'm trying to find the focal length, position and orientation of a camera in world space.

Because I need this to be resolution-independent, I normalized my image coordinates to be in the range [-1, 1] for x, and a somewhat smaller range for y (depending on aspect ratio). So (0, 0) is the center of the image. I've already corrected for lens distortion (using k1 and k2 coefficients), so this does not enter the picture, except sometimes throwing x or y slightly out of the [-1, 1] range.

As a given, I have a planar, fixed rectangle in world space of known dimensions (in millimeters). The four corners of the rectangle are guaranteed to be visible, and are manually marked in the image. For example:

std::vector<cv::Point3f> worldPoints = {
    cv::Point3f(0, 0, 0),
    cv::Point3f(2000, 0, 0),
    cv::Point3f(0, 3000, 0),
    cv::Point3f(2000, 3000, 0),
};
std::vector<cv::Point2f> imagePoints = {
    cv::Point2f(-0.958707, -0.219624),
    cv::Point2f(-1.22234, 0.577061),
    cv::Point2f(0.0837469, -0.1783),
    cv::Point2f(0.205473, 0.428184),
};

Effectively, the equation I think I'm trying to solve is (see the equivalent in the OpenCV documentation):

  / xi \   / fx    0 \ /        tx \ / Xi \
s | yi | = |    fy 0 | |  Rxyz  ty | | Yi |
  \ 1  /   \       1 / \        tz / | Zi |
                                     \ 1  /

where:

  • i is 1, 2, 3, 4
  • xi, yi is the location of point i in the image (between -1 and 1)
  • fx, fy are the focal lengths of the camera in x and y direction
  • Rxyz is the 3x3 rotation matrix of the camera (has only 3 degrees of freedom)
  • tx, ty, tz is the translation of the camera
  • Xi, Yi, Zi is the location of point i in world space (millimeters)

So I have 8 equations (4 points of 2 coordinates each), and I have 8 unknowns (fx, fy, Rxyz, tx, ty, tz). Therefore, I conclude (barring pathological cases) that a unique solution must exist.

However, I can't seem to figure out how to compute this solution using OpenCV.

I have looked at the imgproc module:

  • getPerspectiveTransform works, but gives me a 3x3 matrix only (from 2D points to 2D points). If I could somehow extract the needed parameters from this matrix, that would be great.

I have also looked at the calib3d module, which contains a few promising functions that do almost, but not quite, what I need:

  • initCameraMatrix2D sounds almost perfect, but when I pass it my four points like this:

    cv::Mat cameraMatrix = cv::initCameraMatrix2D(
                std::vector<std::vector<cv::Point3f>>({worldPoints}),
                std::vector<std::vector<cv::Point2f>>({imagePoints}),
                cv::Size2f(2, 2), -1);
    

    it returns me a camera matrix that has fx, fy set to -inf, inf.

  • calibrateCamera seems to use a complicated solver to deal with overdetermined systems and outliers. I tried it anyway, but all I can get from it are assertion failures like this:

    OpenCV(3.4.1) Error: Assertion failed (0 <= i && i < (int)vv.size()) in getMat_, file /build/opencv/src/opencv-3.4.1/modules/core/src/matrix_wrap.cpp, line 79
    

Is there a way to entice OpenCV to do what I need? And if not, how could I do it by hand?

Thomas
  • 174,939
  • 50
  • 355
  • 478
  • 1
    Have you looked at the [calibration examples](https://github.com/opencv/opencv/blob/master/samples/cpp/calibration.cpp)? There are a couple in that directoy. I think they use chess boards but perhaps you can shave down to a single square? – James Poag Jun 01 '18 at 11:47
  • Can be worth to look at a tutorial on camera calibration, for [instance](https://www.csie.ntu.edu.tw/~cyy/courses/vfx/18spring/lectures/handouts/lec09_calibration.pdf) for more theoretical information. – Catree Jun 01 '18 at 13:01
  • A course I meant. Usually, camera calibration is done using a flat chessboard pattern. See the reference paper mentioned in the doc: [A Flexible New Technique for Camera Calibration](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr98-71.pdf). It is also possible to use a 3D calibration rig but using a flat chessboard pattern is more convenient. – Catree Jun 01 '18 at 13:09
  • Thank you both. The cameras are mounted at remote locations several meters above the floor, and shipping large enough chessboards to all of them, and telling folks how to use those, is impractical. But, to restate: 8 equations, 8 unknowns, no overdetermined system, so I shouldn't need a chessboard at all. – Thomas Jun 02 '18 at 15:37
  • Have you considered posting that matrix equation on math.stackexchange.com? Btw I think you should add that `s` on the left, just as it is in the OpenCV docu. Otherwise fools like me will plug in your example points and wonder why there is no solution :) – sebrockm Jun 09 '18 at 00:56

1 Answers1

9

3x3 rotation matrices have 9 elements but, as you said, only 3 degrees of freedom. One subtlety is that exploiting this property makes the equation non-linear in the angles you want to estimate, and non-linear equations are harder to solve than linear ones.

This kind of equations are usually solved by:

  1. considering that the P=K.[R | t] matrix has 12 degrees of freedom and solving the resulting linear equation using the SVD decomposition (see Section 7.1 of 'Multiple View Geometry' by Hartley & Zisserman for more details)

  2. decomposing this intermediate result into an initial approximate solution to your non-linear equation (see for example cv::decomposeProjectionMatrix)

  3. refining the approximate solution using an iterative solver which is able to deal with non-linear equations and with the reduced degrees of freedom of the rotation matrix (e.g. Levenberg-Marquard algorithm). I am not sure if there is a generic implementation of this in OpenCV, however it is not too complicated to implement one yourself using the Ceres Solver library.

However, your case is a bit particular because you do not have enough point matches to solve the linear formulation (i.e. step 1) reliably. This means that, as you stated it, you have no way to initialize an iterative refining algorithm to get an accurate solution to your problem.

Here are a few work-arounds that you can try:

  • somehow get 2 additional point matches, leading to a total of 6 matches hence 12 constraints on your linear equation, allowing you to solve the problem using the steps 1, 2, 3 above.

  • somehow guess manually an initial estimate for your 8 parameters (2 focal lengths, 3 angles & 3 translations), and directly refine them using an iterative solver. Be aware that the iterative process might converge to a wrong solution if your initial estimate is too far off.

  • reduce the number of unknowns in your model. For instance, if you manage to fix two of the three angles (e.g. roll & pitch) the equations might simplify a lot. Also, the two focal lengths are probably related via the aspect ratio, so if you know it and if your pixels are square, then you actually have a single unknown there.

  • if all else fails, there might be a way to extract approximated values from the rectifying homography estimated by cv::getPerspectiveTransform.


Regarding the last bullet point, the opposite of what you want is clearly possible. Indeed, the rectifying homography can be expressed analytically knowing the parameters you want to estimate. See for instance this post and this post. There is also a full chapter on this in the Hartley & Zisserman book (chapter 13).

In your case, you want to go the other way around, i.e. to extract the intrinsic & extrinsic parameters from the homography. There is a somewhat related function in OpenCV (cv::decomposeHomographyMat), but it assumes the K matrix is known and it outputs 4 candidate solutions.

In the general case, this would be tricky. But maybe in your case you can guess a reasonable estimate for the focal length, hence for K, and use the point correspondences to select the good solution to your problem. You might also implement a custom optimization algorithm, testing many focal length values and keeping the solution leading to the lowest reprojection error.

BConic
  • 8,750
  • 2
  • 29
  • 55
  • Naive question: does the scale factor need to be estimated? I wonder if the perspective projection can also contribute to the non-linearity of the problem. Intuitively, since the perspective projection "removes" the depth of a scene, I would expect it. – Catree Jun 04 '18 at 21:00
  • @Catree there is no scale factor to estimate here since the world points are given. Related the perspective projection, you are right that it also makes the problem non-linear. However, using homogeneous coordinates allows to keep a linear formulation. – BConic Jun 04 '18 at 21:38
  • Thanks! Coming up with an initial estimate manually is fine. OpenCV does come with a generic gradient descent solver. Although it doesn't support constraints like Ceres does, I think it should be good enough because the function seems pretty smooth and without local minima. I tried it yesterday, but somehow it doesn't find a perfect solution, as if it doesn't have enough degrees of freedom. By the way, I realized that due to my normalization of image coordinates, `fx = fy` so we have only 7 unknowns; maybe that's it. – Thomas Jun 05 '18 at 08:15
  • Your last bullet point sounds interesting. Isn't this pretty much what `cv::getPerspectiveTransform` does? Because that function works very reliably for me, but it says nothing about the `z` coordinate (it returns a 3x3 matrix that projects 2D points to 2D points). So how would I extract my parameters from that matrix? – Thomas Jun 05 '18 at 08:18
  • I wonder if there even exists a "real-world" camera setup for every matrix that `cv::getPerspectiveTransform` can produce. My intuition says this should be the case. This 3x3 matrix is determined up to a scaling factor, so it also has 8 degrees of freedom, just like a camera with translation, rotation, fx and fy. – Thomas Jun 05 '18 at 08:22
  • @AldurDisciple Thanks for your answer. I am still a little puzzled concerning what I call the "scale factor", the equation should not be `s(x,y,1)^T = K [R|t](X,Y,Z,1)^T`? Also, is it sufficient to use one view with a planar target (with 6 or more points) to fully estimate (in theory) all the parameters? I ask this because I am just wondering if with the classical chessboard calibration procedure, it is necessary to move the target only to better estimate the parameters or because otherwise with one view we cannot estimate all the parameters? Sorry for all my (annoying) questions. – Catree Jun 05 '18 at 09:10
  • @Catree No worries :) In the general case, computing K.[R|t].(X,Y,Z,1)^T will give a vector (a,b,c)^T with c!=1. The scale factor is then s=c so that x=a/c and y=b/c represent valid pixel coordinates. The equation between pixel points & world points is solved correctly using step 1 above (DLT). For the classical calibration procedure, it aims to estimate 5+5+N*(3+3)=6N+10 intrinsic+extrinsics parameters from N views of a 9x6 chessboard, i.e. 108N scalar measurements. Clearly there are more than enough measurements to solve the problem, but this enables reaching better estimates. – BConic Jun 05 '18 at 19:52
  • @Thomas I expanded a bit more the last bullet point. In the general case I think it is not easy to extract K, R & t from the homography matrix. But in your case you might be able to do it, depending on the guesses and approximations that are possible with your application. – BConic Jun 05 '18 at 20:43
  • I still haven't fully figured it out, but with such a comprehensive answer, you deserve the bounty anyway. I've been trying with OpenCV's gradient descent solver, and with Ceres as well. I just plugged in my own manual initial guess though. But it seems to be very sensitive to slight changes in the input: when I drag the input points `xi, yi` around, the solution jitters all over the place. Usually it's reasonable, sometimes completely skewed. Any thoughts on the matter? – Thomas Jun 10 '18 at 07:56
  • Glad I could help :) To answer your question, one thing that I think would definitely help is if you add more point correspondences. Currently, you have enough to solve the theoretical problem, but in practice we usually need more constraints to solve this kind of problems robustly. Another thing is that planar scenes can lead to degenerate cases for some problems, so if you can use 3D point that do not form a plane it might work better. Can you show some sample images ? – BConic Jun 10 '18 at 18:49