1

I’m using a modified version of a gauss-newton method to refine a pose estimate using OpenCV. The unmodified code can be found here: http://people.rennes.inria.fr/Eric.Marchand/pose-estimation/tutorial-pose-gauss-newton-opencv.html

The details of this approach are outlined in the corresponding paper:

Marchand, Eric, Hideaki Uchiyama, and Fabien Spindler. "Pose estimation for augmented reality: a hands-on survey." IEEE transactions on visualization and computer graphics 22.12 (2016): 2633-2651.

A PDF can be found here: https://hal.inria.fr/hal-01246370/document

The part that is relevant (Pages 4 and 5) are screencapped below:

Gauss-Newton Minimization

Here is what I have done. First, I’ve (hopefully) “corrected” some errors: (a) dt and dR can be passed by reference to exponential_map() (even though cv::Mat is essentially a pointer). (b) The last entry of each 2x6 Jacobian matrix, J.at<double>(i*2+1,5), was -x[i].y but should be -x[i].x. (c) I’ve also tried using a different formula for the projection. Specifically, one that includes the focal length and principal point:

xq.at<double>(i*2,0)   = cx + fx * cX.at<double>(0,0) / cX.at<double>(2,0);
xq.at<double>(i*2+1,0) = cy + fy * cX.at<double>(1,0) / cX.at<double>(2,0);

Here is the relevant code I am using, in its entirety (control starts at optimizePose3()):

void exponential_map(const cv::Mat &v, cv::Mat &dt, cv::Mat &dR)
{
  double vx = v.at<double>(0,0);
  double vy = v.at<double>(1,0);
  double vz = v.at<double>(2,0);
  double vtux = v.at<double>(3,0);
  double vtuy = v.at<double>(4,0);
  double vtuz = v.at<double>(5,0);
  cv::Mat tu = (cv::Mat_<double>(3,1) << vtux, vtuy, vtuz); // theta u
  cv::Rodrigues(tu, dR);
  double theta = sqrt(tu.dot(tu));
  double sinc = (fabs(theta) < 1.0e-8) ? 1.0 : sin(theta) / theta;
  double mcosc = (fabs(theta) < 2.5e-4) ? 0.5 : (1.-cos(theta)) / theta / theta;
  double msinc = (fabs(theta) < 2.5e-4) ? (1./6.) : (1.-sin(theta)/theta) / theta / theta;
  dt.at<double>(0,0) = vx*(sinc + vtux*vtux*msinc)
                     + vy*(vtux*vtuy*msinc - vtuz*mcosc)
                     + vz*(vtux*vtuz*msinc + vtuy*mcosc);
  dt.at<double>(1,0) = vx*(vtux*vtuy*msinc + vtuz*mcosc)
                     + vy*(sinc + vtuy*vtuy*msinc)
                     + vz*(vtuy*vtuz*msinc - vtux*mcosc);
  dt.at<double>(2,0) = vx*(vtux*vtuz*msinc - vtuy*mcosc)
                     + vy*(vtuy*vtuz*msinc + vtux*mcosc)
                     + vz*(sinc + vtuz*vtuz*msinc);
}


void optimizePose3(const PoseEstimation &pose, 
                   std::vector<FeatureMatch> &feature_matches,
                   PoseEstimation &optimized_pose) {

  //Set camera parameters
  double fx = camera_matrix.at<double>(0, 0); //Focal length
  double fy = camera_matrix.at<double>(1, 1);
  double cx = camera_matrix.at<double>(0, 2); //Principal point
  double cy = camera_matrix.at<double>(1, 2);


  auto inlier_matches = getInliers(pose, feature_matches);

  std::vector<cv::Point3d> wX;
  std::vector<cv::Point2d> x;

  const unsigned int npoints = inlier_matches.size();
  cv::Mat J(2*npoints, 6, CV_64F);
  double lambda = 0.25;
  cv::Mat xq(npoints*2, 1, CV_64F);
  cv::Mat xn(npoints*2, 1, CV_64F);
  double residual=0, residual_prev;
  cv::Mat Jp;


  for(auto i = 0u; i < npoints; i++) {

    //Model points
    const cv::Point2d &M = inlier_matches[i].model_point();

    wX.emplace_back(M.x, M.y, 0.0);


    //Imaged points
    const cv::Point2d &I = inlier_matches[i].image_point();

    xn.at<double>(i*2,0)   = I.x; // x
    xn.at<double>(i*2+1,0) = I.y; // y

    x.push_back(I);
  }


  //Initial estimation
  cv::Mat cRw = pose.rotation_matrix;
  cv::Mat ctw = pose.translation_vector;

  int nIters = 0;

  // Iterative Gauss-Newton minimization loop
  do {
    for (auto i = 0u; i < npoints; i++) {

      cv::Mat cX = cRw * cv::Mat(wX[i]) + ctw;            // Update cX, cY, cZ


      // Update x(q)
      //xq.at<double>(i*2,0)   = cX.at<double>(0,0) / cX.at<double>(2,0);    // x(q) = cX/cZ
      //xq.at<double>(i*2+1,0) = cX.at<double>(1,0) / cX.at<double>(2,0);    // y(q) = cY/cZ
      xq.at<double>(i*2,0)   = cx + fx * cX.at<double>(0,0) / cX.at<double>(2,0);
      xq.at<double>(i*2+1,0) = cy + fy * cX.at<double>(1,0) / cX.at<double>(2,0);

      // Update J using equation (11)
      J.at<double>(i*2,0) = -1 / cX.at<double>(2,0);        // -1/cZ
      J.at<double>(i*2,1) = 0; 
      J.at<double>(i*2,2) = x[i].x / cX.at<double>(2,0);    // x/cZ
      J.at<double>(i*2,3) = x[i].x * x[i].y;                // xy
      J.at<double>(i*2,4) = -(1 + x[i].x * x[i].x);         // -(1+x^2)
      J.at<double>(i*2,5) = x[i].y;                         // y

      J.at<double>(i*2+1,0) = 0;
      J.at<double>(i*2+1,1) = -1 / cX.at<double>(2,0);      // -1/cZ
      J.at<double>(i*2+1,2) = x[i].y / cX.at<double>(2,0);  // y/cZ
      J.at<double>(i*2+1,3) = 1 + x[i].y * x[i].y;          // 1+y^2
      J.at<double>(i*2+1,4) = -x[i].x * x[i].y;             // -xy
      J.at<double>(i*2+1,5) = -x[i].x;                      // -x
    }



    cv::Mat e_q = xq - xn;                     // Equation (7)


    cv::Mat Jp = J.inv(cv::DECOMP_SVD);        // Compute pseudo inverse of the Jacobian

    cv::Mat dq = -lambda * Jp * e_q;           // Equation (10)

    cv::Mat dctw(3, 1, CV_64F), dcRw(3, 3, CV_64F);
    exponential_map(dq, dctw, dcRw);

    cRw = dcRw.t() * cRw;                      // Update the pose
    ctw = dcRw.t() * (ctw - dctw);

    residual_prev = residual;                  // Memorize previous residual
    residual = e_q.dot(e_q);                   // Compute the actual residual

    std::cout << "residual_prev: " << residual_prev << std::endl;
    std::cout << "residual: " << residual << std::endl << std::endl;

    nIters++;

  } while (fabs(residual - residual_prev) > 0);
  //} while (nIters < 30);

  optimized_pose.rotation_matrix = cRw;
  optimized_pose.translation_vector = ctw;
  cv::Rodrigues(optimized_pose.rotation_matrix, optimized_pose.rotation_vector); 
}

Even when I use the functions as given, it does not produce the correct results. My initial pose estimate is very close to optimal, but when I try run the program, the method takes a very long time to converge - and when it does, the results are very wrong. I’m not sure what could be wrong and I’m out of ideas. I’m confident my inliers are actually inliers (they were chosen using an M-estimator). I’ve compared the results from exponential map with those from other implementations, and they seem to agree.

So, where is the error in this gauss-newton implementation for pose optimization? I’ve tried to make things as easy as possible for anyone willing to lend a hand. Let me know if there is anymore information I can provide. Any help would be greatly appreciated. Thanks.

socratic
  • 329
  • 2
  • 9
  • 1
    Does this method work on CPU? Does this method work with simple functions like parabola with known optimum? – user3811082 May 08 '17 at 20:51
  • 1
    In all of this, I see no mention of debugger usage. Problems with code are solved by the programmer by using their debugger and seeing where in the code the flow, variables, etc. go against what is expected. – PaulMcKenzie May 08 '17 at 21:22
  • 1
    I think `xq` should be coordinates in the normalized camera frame (at `z=1`). There are no intrinsic parameters in the original code link. Issue is probably a problem in either the implementation from the theory to the code or a mistake in the theory. Cannot help if it is the case. The only other issue I see could be a problem in the matrix inverse computation. In this case, try to implement the code with Matlab for example or use another matrix inverse code. Have you tried with different input data to reproduce the issue? Have you tried to reproduce the example in the original code link? – Catree May 09 '17 at 09:27
  • Thank you all for your ideas. @Catree Although the initial code did not use the intrinsic camera parameters, I still think they were wrong not to do so. It worked in their toy example because when they were deriving the ground truth, they used normalized coordinates to compute the image points. In a real example, shouldn't the points be projected onto the image plane (z=fx instead of z=1)? The residuals are smaller when I project the (model) points onto the actual image plane and compare with the imaged interest points. Please explain why I am wrong! – socratic May 09 '17 at 18:04
  • 1
    In a real example, you have to transform 2D image coordinate to the normalized camera frame using the intrinsic parameters and the distortion coefficients: `x=(u-cx)/fx` and `y=(u-cy)/fy` (here assuming zero distortion). See in OpenCV: [`cv::undistortPoints()`](http://docs.opencv.org/3.2.0/da/d54/group__imgproc__transform.html#ga55c716492470bfe86b0ee9bf3a1f0f7e) to undistort with distortion coefficients. – Catree May 09 '17 at 18:21
  • 1
    Typos correction: `y=(v-cy)/fy`. – Catree May 09 '17 at 18:38
  • @Catree Thanks for your comment. Does it really make a difference whether one transforms the image points to normalized coordinates or, instead, one projects the model points using the reverse transformation? Either way, they have to agree. I tried undistorting the points but it didn't make much difference. – socratic May 09 '17 at 19:59
  • @Catree Nevermind, `undistortPoints()` was a great idea! The results are very jerky, but they are actually coherent. At first, I didn't realize that `undistortPoints()` doesn't transform the points back to `u` and `v` unless a `P` matrix is provided. So i'm using the `x(q)` as originally provided in the author's implementation. Combined with `undistortPoints()`, that did the trick. If you post an answer, I will accept it. – socratic May 09 '17 at 20:55

1 Answers1

3

Edit: 2019/05/13

There is now solvePnPRefineVVS function in OpenCV.

Also, you should use x and y calculated from the current estimated pose instead.


In the cited paper, they expressed the measurements x in the normalized camera frame (at z=1).

When working with real data, you have:

  • (u,v): 2D image coordinates (e.g. keypoints, corner locations, etc.)
  • K: the intrinsic parameters (obtained after calibrating the camera)
  • D: the distortion coefficients (obtained after calibrating the camera)

To compute the 2D image coordinates in the normalized camera frame, you can use in OpenCV the function cv::undistortPoints() (link to my answer about cv::projectPoints() and cv::undistortPoints()).

When there is no distortion, the computation (also called "reverse perspective transformation") is:

  • x = (u - cx) / fx
  • y = (v - cy) / fy
Catree
  • 2,477
  • 1
  • 17
  • 24