(1) Yes, neural networks have fixed input dimensions. These can be adjusted to fit your purpose, but at last you need to commit to a defined input dimension, and thus you need to input your images fitting these dimensions. For YOLO I found the following:
layer filters size input output
0 conv 32 3 x 3 / 1 416 x 416 x 3 -> 416 x 416 x 32
It could be that the framework you are using already does that step for you. Maybe somebody could comment on that.
(3) The images / samples you feed during inference, for prediction should be as similar to the training images / samples as possible. So whatever preprocessing you re doing with your training data, you should definitely do the same on your inference data.
(2) Smaller images make sense if your hardware is not able to hold larger images in memory, or if you train with large batch sizes so that your hardware needs to hold multiple images in memory at ones. In the end, the computational time is rather proportional to the amount of operations of your architecture, not necessarily to the images size.