For your first query regarding x and y dimension there are two explanation.
Reason 1.
For image re-projection pin hole camera model is used which is in
perspective coordinate or homogenous coordinate. Perspective
projection uses the image origin as centre of projection and points
are mapped to the plane z=1. A 3D point [x y z] is represented
by [xw yw zw w] and the point it maps on the plane is represented
by [xw yw zw]. Normalising with w gives.
So (x,y) -> [x y 1]T : Homogeneous Image Coordinates
and (x,y,z) - > [x y z 1] T : Homogeneous Scene Coordinates
Reason 2.
With respect to the paper you have attached, considering equation
(4) and (5)


It is clear that P is of dimension 3X4 and R is expanded to 4x4 dimension.Also x is of dimension 1x4. So as per matrix multiplication rule number of columns of first matrix must equal to the number of rows of second matrix. So for given P of 3x4 and R of 4x4, x has to be 1x4.
Now coming to your second question of LiDAR image fusion, It requires intrinsic and extrinsic parameters (relative rotation and translation) and camera matrix. This rotation and translation forms a 3x4 matrix called as transformation matrix. So the point fusion equations becomes
[x y 1]^T = Transformation Matrix * Camera Matrix * [X Y Z 1]^T
You can also refer :: Lidar Image Fusion KITTI
Once your LiDAR image fusion is done, you can input this image to your CNN model.I am not aware of DNN modules for LiDAR fused image.
Hope this helps..