Use-case
- Generate a synthetic 3D scene using random points
- Generate two synthetic cameras
- Get the 2 camera 2D projections
- Derive the Fundamental & Essential matrices
- Derive rotation & translation using the Essential matrix
- Triangulate the 2 2D projections to result the initial 3D scene
Implementation
- Random 3D points ( x, y, z ) are generated
- Camera intrinsic matrix is statically defined
- Rotation matrix is statically defined ( 25 deg rotation on the Z-axis )
- Identity Translation Matrix ( no translation )
- Two projections are synthetically generated ( K*R*T )
- Fundamental matrix is resolved using cv::findFundamentalMat ( F )
- Essential matrix E is computed using 'K.t() * F * K'
- Camera extrinsic is extracted using SVD resulting in 4 possible solutions ( in accordance to 'Hartley & Zisserman Multiple View Geometry chapter 9.2.6 )
- Triangulation is done using cv::triangulatePoints in the following manner: cv::triangulatePoints(K * matRotIdentity, K * R * T, v1, v2, points);
- 'points' is a 4-rows N-columns matrix with homogeneous coordinates ( x, y, z, w )
- 'points' is converted to un-homogeneous ( local ) coordinates by dividing 'x, y, z' with 'w'
The result
The resulting 3D points match to the original points up to a scale ( ~144 in my case ).
Questions
- Camera Translation is derived up to a scale ( at #8 ), having that in mind, would it be right to assume that the triangulation result is also up to a scale?
- Is it possible to derive scale w/o having any prior knowledge of the camera position or absolute size of the points ?
Any help would be appreciated.
EDIT:
I was trying to use the exact same projection matrices used for the 3D -> 2D projection to convert back from 2D to 3D ( using cv::tirangulatePoints ), surprisingly, this has resulted a null vector ( all 3D points have x,y,z,w == 0 ), this has ended up to be because the two cameras differed only by rotation and not by transation, and thus, the two projection lines were orthogonal ( in 3D Space ) resulting a zero length baseline ( an epipolar line rather than a plane ), and thus, minimizing the distance at x,y,z == 0, resulting the null vector.
Adding a translation between the two cameras resulted in properly recovering the original coordinates, this however, given that I was using the exact same projection matrices for the 3D to 2D transfer, AND, then, 3D back to 2D triangulation.
When doing camera pose estimation ( extracting the projection matrix from points correspondence ) the translation is derived up to a scale, and thus, the triangulation result is also up to a scale.
Question
Is it possible to derive the difference in translation ( how much the camera has moved ) in metric/pixel, ... units rather than up to a scale? what prior knowledge is needed?