The global scale of your calibration (i.e. the units of measure of 3D space coordinates) is determined by the geometry of the calibration object you use. For example, when you calibrate in OpenCV using images of a flat checkerboard, the inputs to the calibration procedure are corresponding pairs (P, p) of 3D points P and their images p, the (X, Y, Z) coordinates of the 3D points are expressed in mm, cm, inches, miles, whatever, as required by the size of target you use (and the optics that images it), and the 2D coordinates of the images are in pixels. The output of the calibration routine is the set of parameters (the components of the projection matrix P and the non-linear distortion parameters k) that "convert" 3D coordinates expressed in those metrical units into pixels.
If you don't know (or don't want to use) the actual dimensions of the calibration target, you can just fudge them but leave their ratios unchanged (so that, for example, a square remains a square even though the true length of its side may be unknown). In this case your calibration will be determined up to an unknown global scale. This is actually the common case: in most virtual reality applications you don't really care what the global scale is, as long as the results look correct in the image.
For example, if you want to add an even puffier pair of 3D lips on a video of Angelina Jolie, and composite them with the original video so that the brand new fake lips stay attached and look "natural" on her face, you just need to rescale the 3D model of the fake lips so that it overlaps correctly the image of the lips. Whether the model is 1 yard or one mile away from the CG camera in which you render the composite is completely irrelevant.