Camera motion from corresponding images

Question

I'm trying to calculate a new camera position based on the motion of corresponding images. the images conform to the pinhole camera model.

As a matter of fact, I don't get useful results, so I try to describe my procedure and hope that somebody can help me.

I match the features of the corresponding images with SIFT, match them with OpenCV's FlannBasedMatcher and calculate the fundamental matrix with OpenCV's findFundamentalMat (method RANSAC).

Then I calculate the essential matrix by the camera intrinsic matrix (K):

Mat E = K.t() * F * K;

I decompose the essential matrix to rotation and translation with singular value decomposition:

SVD decomp = SVD(E);
Matx33d W(0,-1,0,
          1,0,0,
          0,0,1);
Matx33d Wt(0,1,0,
          -1,0,0,
           0,0,1);
R1 = decomp.u * Mat(W) * decomp.vt;
R2 = decomp.u * Mat(Wt) * decomp.vt;
t1 = decomp.u.col(2); //u3
t2 = -decomp.u.col(2); //u3

Then I try to find the correct solution by triangulation. (this part is from http://www.morethantechnical.com/2012/01/04/simple-triangulation-with-opencv-from-harley-zisserman-w-code/ so I think that should work correct).

The new position is then calculated with:

new_pos = old_pos + -R.t()*t;

where new_pos & old_pos are vectors (3x1), R the rotation matrix (3x3) and t the translation vector (3x1).

Unfortunately I got no useful results, so maybe anyone has an idea what could be wrong.

Here are some results (just in case someone can confirm that any of them is definitely wrong):

F = [8.093827077399547e-07, 1.102681999632987e-06, -0.0007939604310854831;
     1.29246107737264e-06, 1.492629957878578e-06, -0.001211264339006535;
     -0.001052930954975217, -0.001278667878010564, 1]

K = [150, 0, 300;
    0, 150, 400;
    0, 0, 1]

E = [0.01821111092414898, 0.02481034499174221, -0.01651092283654529;
     0.02908037424088439, 0.03358417405226801, -0.03397110489649674;
     -0.04396975675562629, -0.05262169424538553, 0.04904210357279387]

t = [0.2970648246214448; 0.7352053067682792; 0.6092828956013705]

R = [0.2048034356172475, 0.4709818957303019, -0.858039396912323;
     -0.8690270040802598, -0.3158728880490416, -0.3808101689488421;
     -0.4503860776474556, 0.8236506374002566, 0.3446041331317597]

There is one more mistake in your computation. `SVD decomp = SVD(E);` is ok but you have to compute a new `newE = U*diag(1,1,0)*Vt` and then again you have to get `SVD decomp2 = SVD(newE);`. — who9vy, May 27 '13 at 13:57
interesting. I never read about that. So I have do calculate R and t with decomp2? btw: thanks for your detailed answer. I have to check all the things and will respond as soon as possible. — pichsenmeister, May 27 '13 at 14:04
Yes, you have to compute R and t with decomp2. A detailed description is given here (pp 257-260) http://www.robots.ox.ac.uk/~vgg/hzbook/hzbook2/HZepipolar.pdf — who9vy, May 27 '13 at 15:41

who9vy · Accepted Answer · 2013-05-27T13:59:14.017

First of all you should check if

x' * F * x = 0

for your point correspondences x' and x. This should be of course only the case for the inliers of the fundamental matrix estimation with RANSAC.

Thereafter, you have to transform your point correspondences to normalized image coordinates (NCC) like this

xn = inv(K) * x
xn' = inv(K') * x'

where K' is the intrinsic camera matrix of the second image and x' are the points of the second image. I think in your case it is K = K'.

With these NCCs you can decompose your essential matrix like you described. You triangulate the normalized camera coordinates and check the depth of your triangulated points. But be careful, in literature they say that one point is sufficient to get the correct rotation and translation. From my experience you should check a few points since one point can be an outlier even after RANSAC.

Before you decompose the essential matrix make sure that E=U*diag(1,1,0)*Vt. This condition is required to get correct results for the four possible choices of the projection matrix.

When you've got the correct rotation and translation you can triangulate all your point correspondences (the inliers of the fundamental matrix estimation with RANSAC). Then, you should compute the reprojection error. Firstly, you compute the reprojected position like this

xp = K * P * X
xp' = K' * P' * X

where X is the computed (homogeneous) 3D position. P and P' are the 3x4 projection matrices. The projection matrix P is normally given by the identity. P' = [R, t] is given by the rotation matrix in the first 3 columns and rows and the translation in the fourth column, so that P is a 3x4 matrix. This only works if you transform your 3D position to homogeneous coordinates, i.e. 4x1 vectors instead of 3x1. Then, xp and xp' are also homogeneous coordinates representing your (reprojected) 2D positions of your corresponding points.

I think the

new_pos = old_pos + -R.t()*t;

is incorrect since firstly, you only translate the old_pos and you do not rotate it and secondly, you translate it with a wrong vector. The correct way is given above.

So, after you computed the reprojected points you can calculate the reprojection error. Since you are working with homogeneous coordinates you have to normalize them (xp = xp / xp(2), divide by last coordinate). This is given by

error = (x(0)-xp(0))^2 + (x(1)-xp(1))^2

If the error is large such as 10^2 your intrinsic camera calibration or your rotation/translation are incorrect (perhaps both). Depending on your coordinate system you can try to inverse your projection matrices. On that account you need to transform them to homogeneous coordinates before since you cannot invert a 3x4 matrix (without the pseudo inverse). Thus, add the fourth row [0 0 0 1], compute the inverse and remove the fourth row.

There is one more thing with reprojection error. In general, the reprojection error is the squared distance between your original point correspondence (in each image) and the reprojected position. You can take the square root to get the Euclidean distance between both points.

is the equation `x' * F * x = 0` in practice exactly 0, too? are values like 1,12345*e^-14 still good? Am I right that I can discard all outliers before NCC calculation? — pichsenmeister, May 28 '13 at 12:20
I tried the things you described and ending up with the projection matrix. the reprojection error is < 10^2. but what I don't get is how to move the camera position in world coordinates. I want to calculate in which direction (x,y,z) the camera moved. I thought this was done by `new_pos = old_pos + -R.t()*t;`. since you said that's wrong, do you know how I can do that with projection matrix? — pichsenmeister, May 28 '13 at 18:52
One camera is in `(0, 0, 0)` and the other is located in `t`. Furthermore, the second camera is rotated by `R`. The matrix `P` (called projection matrix) which is composed from `R` and `t`is rigid body transformation transforming each 3D point `p` from the coordinate system represented by the identity matrix to the coordinate system represented by `P`. The transformation is done by `newp = P * p`, where `P` is a 3x4 or 4x4 matrix and `p` is a homogeneous 3D point, i.e. 4-vector. In general, the last component of `p` is equal to 1. — who9vy, May 30 '13 at 12:01
thanks for your answer. your explanations were very helpful and it seems that I get useful results now. — pichsenmeister, May 30 '13 at 21:52
When I decompose the Essential matrix, U and Vt has only 0,1 and -1 elements while S, that supposed to be diag(1,1,0) is diag(4.28,1,0.902). Because the U and Vt matrices are off, I cannot proceed to calculating the translation and rotation. What should I do the decompose the E matrix correctly? — b_m, Oct 09 '13 at 21:08
As noted in http://en.wikipedia.org/wiki/Eight-point_algorithm you can enforce the constraint by computing the SVD of `E`, i.e., `E = U*D*V'`. Afterwards, you're new essential matrix `E'` is given by `E'=U*diag(1,1,0)*V'`. — who9vy, Oct 09 '13 at 21:28

score 0 · Answer 2 · answered May 23 '17 at 15:29

To update your camera position, you have to update the translation first, then update the rotation matrix.

t_ref += lambda * (R_ref * t);
R_ref = R * R_ref;

where t_ref and R_ref are your camera state, R and t are new calculated camera rotation and translation, and lambda is the scale factor.

Camera motion from corresponding images

2 Answers2

Linked