your case is pretty simple, but requires initialization for the object since in general a term "object" is ill-defined. It can be a closest object or moving object or even the object that was touched, has certain color, size or shape.
Let's assume that you define object by motion that is whatever moves in your point cloud is an object. I suggest to do this:
- Object detection is easy if object moves more than its size since
then you just may subtract depth maps and end up with your object:
depth1-depth2 > T
but if the object moves slowly and shifts only by a
fraction of its size you have to use whatever high frequency info you
have, which can be depth or colour or both. It is going to be noisy as the figure below shows

- as soon as you have your object selected you may want to clean it by running some morphological filters (erode +
dilate) to erase noise and get a single blob. After that you just
need to find some features in the blob such as average depth or mean
color and look for them in a small window around the object's previous
location in order to rediscover the object;
- finally don't forget to update these features as object moves
through.
Some other ideas you may want to use are: depth gradient, connected components in depth, pre-recording background depth for cleaner subtraction, running grabCut on depth area selected by mouse click, etc.