I am currently trying to import and parse outputs from the SMPL/VIBE model into Blender. The goal is to overlay the predicted mesh/armature on the original image and render it in blender similar to the visualization that is output by pyrenderer as part of the VIBE pipeline.
I have been able to parse all bone poses etc with no issues into blender, but the camera itself is giving me problems.
There are several outputs form SMPL that are relevant, specifically the "pred_cam" (predicted weak perspective cam in cropped image coordinates(bbox)(s,tx,ty) ) and the "original_cam" ( weak perspective in uncropped coordinates (sx,sy,tx,ty)).
The issue at hand is that I do not know how to apply any of these parameters to Blenders camera projection models.
The pred_cam values are being generated by the SMPL model and the original cam is calculated using the following method:
def convert_crop_cam_to_orig_img(cam, bbox, img_width, img_height):
'''
Convert predicted camera from cropped image coordinates
to original image coordinates
:param cam (ndarray, shape=(3,)): weak perspective camera in cropped img coordinates
:param bbox (ndarray, shape=(4,)): bbox coordinates (c_x, c_y, h)
:param img_width (int): original image width
:param img_height (int): original image height
:return:
'''
cx, cy, h = bbox[:,0], bbox[:,1], bbox[:,2]
hw, hh = img_width / 2., img_height / 2.
sx = cam[:,0] * (1. / (img_width / h))
sy = cam[:,0] * (1. / (img_height / h))
tx = ((cx - hw) / hw / sx) + cam[:,1]
ty = ((cy - hh) / hh / sy) + cam[:,2]
orig_cam = np.stack([sx, sy, tx, ty]).T
return orig_cam
From what I understand an orthographic projection model would seem the most suited for this, since it has a orthographic scale fatcor and corresponding Shift X/Y values which scale based on the orthographic scale of the camera.
However simply parsing the original_cam values into the camera (assuming orthographic_scale = sx) does not yield expected results.
Camera translation values (tx,ty):
My guess is that all of the output parameters are somehow normalized to the image space of the original input image ( see function above) and I would have to remap the tx,ty values to NDI/Screen Space coordinates. Is this the case?
I had also found an approach that states: "when assuming a constant focal length one can convert a weak perspective to perspective camera":
perspective_camera = torch.stack([pred_camera_translation_x, pred_camera_translation_y,
2 * focal_length / (crop_size * pred_camera_scale + 1e-9)], dim=-1)
My implementation as suggested here:
w = 1080
h = 1920
cam = orig_cam // original camera in image space
cam_s = cam[0:1] // cam scale
cam_pos = cam[2:] // camera position
flength = w / 2.
tz = flength / (0.5 * w * cam_s)
trans = -np.hstack([cam_pos, tz])
However none of these approaches yielded any promising results, since a perspective camera with a focal length of approx. 500 is very improbable and would require the mesh to be scaled very small.
Camera Scale values (sx,sy)
What also is very confusing is Blenders orthographic scale. It seems to be an arbitrary zoom value and also does not seem to correlate with any other units (iE scale of 2 = focal length of etc or ortho scale of 2 = 2 units in zdepth etc) To make matter worse the default value is set to 6.0.
What do the output sx,sy values correspond to ? What does an sx sy value of 1 represent? Would this mean that the Bounding box of the tracked body is the same size as the image?
Pyrenderer approach
In the Vibe model, pyrenderer just passes the scale values to the renderer and sets the camera projection matrix. I am considering creating a camera in an .obj and importing it to blender, since afaik the setter is not exposed for the camera's projection matrix.
class WeakPerspectiveCamera(pyrender.Camera):
def __init__(self,
scale,
translation,
znear=pyrender.camera.DEFAULT_Z_NEAR,
zfar=None,
name=None):
super(WeakPerspectiveCamera, self).__init__(
znear=znear,
zfar=zfar,
name=name,
)
self.scale = scale
self.translation = translation
def get_projection_matrix(self, width=None, height=None):
P = np.eye(4)
P[0, 0] = self.scale[0]
P[1, 1] = self.scale[1]
P[0, 3] = self.translation[0] * self.scale[0]
P[1, 3] = -self.translation[1] * self.scale[1]
P[2, 2] = -1
return P
// in the main loop
sx, sy, tx, ty = cam
camera = WeakPerspectiveCamera(
scale=[sx, sy],
translation=[tx, ty],
zfar=1000.
)
camera_pose = np.eye(4)
cam_node = self.scene.add(camera, pose=camera_pose)
I feel like I have all of the info that might be needed to solve this issue, just for the life of me I cant seem to piece it together. Probably due to my insufficient knowledge about the Weak perspective camera model. If any of you have any expertise for this topic and could guide me in the right direction I would be very greatful.