3d point reconstruction from 2d Images

Question

I am trying understand basics of 3d point reconstruction from 2d stereo images. What I have understood so far can be summarized as below:

For 3d point (depth map) reconstruction, we need 2 images of the same object from 2 different view, given such image pair we also need Camera matrix (say P1, P2)

We find the corresponding points in the two images using methods like SIFT or SURF etc.
After getting corresponding key point, we find find the essential matrix (say K) using minimum 8 key points (used in 8-point algorithm)
Given we are at camera 1, calculate the parameters for camera 2 Using the essential matrix returns 4 possible camera parameters
Eventually we use corresponding points and both camera parameters for 3d point estimation using triangulation method.

After going through theory section, as my first experiment I tried to run the code available here, Which worked as expected. With a few modification in the example.py code I tried to run this example on all the consecutive image pairs and merge the 3-d point clouds for 3d reconstruction of object (dino) as below:

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import cv2

from camera import Camera
import structure
import processor
import features

def dino():
    # Dino
    img1 = cv2.imread('imgs/dinos/viff.003.ppm')
    img2 = cv2.imread('imgs/dinos/viff.001.ppm')
    pts1, pts2 = features.find_correspondence_points(img1, img2)
    points1 = processor.cart2hom(pts1)
    points2 = processor.cart2hom(pts2)

    fig, ax = plt.subplots(1, 2)
    ax[0].autoscale_view('tight')
    ax[0].imshow(cv2.cvtColor(img1, cv2.COLOR_BGR2RGB))
    ax[0].plot(points1[0], points1[1], 'r.')
    ax[1].autoscale_view('tight')
    ax[1].imshow(cv2.cvtColor(img2, cv2.COLOR_BGR2RGB))
    ax[1].plot(points2[0], points2[1], 'r.')
    fig.show()

    height, width, ch = img1.shape
    intrinsic = np.array([  # for dino
        [2360, 0, width / 2],
        [0, 2360, height / 2],
        [0, 0, 1]])

    return points1, points2, intrinsic


points3d = np.empty((0,0))
files = glob.glob("imgs/dinos/*.ppm")
len = len(files)

for item in range(len-1):
    print(files[item], files[(item+1)%len])
    #dino() function takes 2 images as input
    #and outputs the keypoint point matches(corresponding points in two different views) along the camera intrinsic parameters.
    points1, points2, intrinsic = dino(files[item], files[(item+1)%len])
    #print(('Length', len(points1))
    # Calculate essential matrix with 2d points.
    # Result will be up to a scale
    # First, normalize points
    points1n = np.dot(np.linalg.inv(intrinsic), points1)
    points2n = np.dot(np.linalg.inv(intrinsic), points2)
    E = structure.compute_essential_normalized(points1n, points2n)
    print('Computed essential matrix:', (-E / E[0][1]))

    # Given we are at camera 1, calculate the parameters for camera 2
    # Using the essential matrix returns 4 possible camera paramters
    P1 = np.array([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0]])
    P2s = structure.compute_P_from_essential(E)

    ind = -1
    for i, P2 in enumerate(P2s):
        # Find the correct camera parameters
        d1 = structure.reconstruct_one_point(
            points1n[:, 0], points2n[:, 0], P1, P2)

        # Convert P2 from camera view to world view
        P2_homogenous = np.linalg.inv(np.vstack([P2, [0, 0, 0, 1]]))
        d2 = np.dot(P2_homogenous[:3, :4], d1)

        if d1[2] > 0 and d2[2] > 0:
            ind = i

    P2 = np.linalg.inv(np.vstack([P2s[ind], [0, 0, 0, 1]]))[:3, :4]
    #tripoints3d = structure.reconstruct_points(points1n, points2n, P1, P2)
    tripoints3d = structure.linear_triangulation(points1n, points2n, P1, P2)

    if not points3d.size:
        points3d = tripoints3d
    else:
        points3d = np.concatenate((points3d, tripoints3d), 1)


fig = plt.figure()
fig.suptitle('3D reconstructed', fontsize=16)
ax = fig.gca(projection='3d')
ax.plot(points3d[0], points3d[1], points3d[2], 'b.')
ax.set_xlabel('x axis')
ax.set_ylabel('y axis')
ax.set_zlabel('z axis')
ax.view_init(elev=135, azim=90)
plt.show()

But I am getting very unexpected result. Please suggest me if above method is correct or how can i merge multiple 3d point clouds to construct a single 3-d structure.

If you proceed like this, the 3D points reconstructed from each pair will be in different coordinate frames, so simply concatenating them will not give anything meaningful. Let's say you want to build a panorama from a series of pictures taken by rotating the camera progressively. If you just stack the pictures on top of each other, you won't get a panorama. For that you would need to shift the images as they rotate. For the point cloud it is the same, you need to align the separate point clouds consistently with one another. — BConic, Nov 15 '18 at 20:05
Thanks @aldurdisciple, yes I learned your point a couple of days ago. That is why I updated my question to How to merge multiple point clouds of different views? — flamelite, Nov 16 '18 at 04:05
Your code doesn't include the the definition of the `dino` function, and neither does the code you link to. Can you please add it in? — tel, Nov 16 '18 at 08:16
dino() function provides the key point matches and intrinsic parameters, i thought posting its implementation here will be irrelevant though i have given its github link https://github.com/alyssaq/3Dreconstruction — flamelite, Nov 16 '18 at 08:37
My point is that you've made at least some changes to it (since the one in the repo doesn't take any arguments). Given that you've modified it, you should include it. Also, you need to include your import statements (like `import structure`). Otherwise, people can't run your example code. — tel, Nov 16 '18 at 08:54
Aligning all of them consistently to each other simultaneously is usually accomplished through bundle adjustment. — alkasm, Nov 21 '18 at 19:44
@AlexanderReynolds Could you please add a link to a good resource describing an actual bundle adjustment algorithm/implementation? — tel, Nov 21 '18 at 20:12
Not sure the best resource for 3D scenes, but for 2d/panoramas, Richard Szeliski's [Image Alignment and Stitching: A Tutorial](https://www.microsoft.com/en-us/research/publication/image-alignment-and-stitching-a-tutorial/) is a great resource which gives a good high level overview with really great references to dig in more. Hope it's helpful. — alkasm, Nov 21 '18 at 20:15

score 6 · Answer 1 · answered Nov 21 '18 at 19:35

Another possible path of understanding for you would be to look at an open source implementation of structure from motion or SLAM. Note that these systems can become quite complicated. However, OpenSfM is written in Python and I think it is easy to navigate and understand. I often use it as a reference for my own work.

Just to give you a little more information to get started (if you choose to go down this path). Structure from motion is an algorithm for taking a collection of 2D images and creating a 3D model (point cloud) from them where it also solves for the position of each camera relative to that point cloud (i.e. all the returned camera poses are in the world frame and so is the point cloud).

The steps of OpenSfM at a high level:

Read image exif for any prior information you can use (e.g. focal length)
Extract feature points (e.g. SIFT)
Match feature points
Turn these feature point matches into tracks (e.g. if you saw a feature point in image 1,2, and 3, then you can connect that into a track instead of match(1,2), match(2,3), etc...)
Incremental Reconstruction (note that there is also a global approach). This process will use the tracks to incrementally add images to the reconstruction, triangulate new points, and refine the poses/point positions using a process called Bundle Adjustment.

Hopefully that helps.

BConic · Answer 2 · 2018-11-21T06:19:44.590

The general idea is as follows.

In each iteration of your code, you compute the relative pose of the right camera with respect to the left. Then you triangulate the 2D points and concatenate the resulting 3D points in a big array. But the concatenated points are not in the same coordinate frame.

What you need to do instead is to accumulate the estimated relative poses in order to maintain an absolute pose estimate. Then you can triangulate the 2D points as before, but before concatenating the resulting points, you need to map them to the coordinate frame of the first camera.

Here is how to do this.

First, before the loop, initialize an accumulation matrix absolute_P1:

points3d = np.empty((0,0))
files = glob.glob("imgs/dinos/*.ppm")
len = len(files)
absolute_P1 = np.array([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1]])

for item in range(len-1):
    # ...

Then, after the feature triangulation, map the 3D points to the coordinate frame of the first camera and update the accumulated pose:

# ...
P2 = np.linalg.inv(np.vstack([P2s[ind], [0, 0, 0, 1]]))
tripoints3d = structure.linear_triangulation(points1n, points2n, P1, P2[:3, :4])

abs_tripoints3d = np.matmul(absolute_P1, np.vstack([tripoints3d, np.ones(np.shape(tripoints3d)[1])]))
absolute_P1 = np.matmul(absolute_P1, np.linalg.inv(P2)) # P2 needs to be 4x4 here!

if not points3d.size:
    points3d = abs_tripoints3d
else:
    points3d = np.concatenate((points3d, abs_tripoints3d), 1)

# ...

This answer is not right. [Here's a figure](https://imgur.com/a/gdMF9xH) with what it produces for the first two sets of dino images. It just gets worse from there. — tel, Nov 21 '18 at 03:45
Thanks for the feedback @tel, I must have misunderstood the camera pose convention OP is using. I updated the second code snippet, it should work better now. — BConic, Nov 21 '18 at 06:22

score -2 · Answer 3 · answered Nov 21 '18 at 17:14

TL;DR

You may not be able to get the full 3D reconstruction you want by just combining all of the 2 image reconstructions. I tried to do this in many different ways, and none of them worked. Basically, the failures all seem to boil down to noise in the 2 image pose estimation algorithm, which frequently produces unreasonable results. Any attempt to track the absolute pose by simply combining all of the 2 image poses just propagates the noise throughout the reconstruction.

The code in the repo that the OP is working with is based on a textbook, Multiple View Geometry in Computer Vision. Chapter 19 cites a paper that discusses a successful 3D reconstruction of the dinosaur sequence, and their approach is somewhat more involved. In addition to 2 image reconstructions, they also use 3 image reconstructions, and (maybe most importantly) a fitting step at the end that helps to ensure that no single spurious result ruins the reconstruction.

code

...in progress

3d point reconstruction from 2d Images

3 Answers3

TL;DR

code