Since you've mentioned that you are a beginner I'll try to be a bit more verbose than normal so please bear with me.
How neural models recognise images
- The layers in a pre-trained model store multiple aspects of the images they were trained on like patterns(lines, curves), colours within the image which it uses to decide if an image is of a specific class or not
- With each layer the complexity of what it can store increases initially it captures lines or dots or simple curves but with each layer, the representation power increases and it starts capturing features like cat ears, dog face, curves in a number etc.
The image below from Keras blog shows how initial layers learn to represent simple things like dots and lines and as we go deeper they start to learn to represent more complex patterns.

Read more about Conv net Filters at keras's blog here
How does using a pretrained model give better results ?
When we train a model we waste a lot of compute and time initially creating these representations and in order to get to those representations we need quite a lot of data too else we might not be able to capture all relevant features and our model might not be as accurate.
So when we say we want to use a pre-trained model we want to use these representations so if we use a model trained on imagenet which has lots of cat pics we can be sure that the model already has representations to identify important features required to identify a cat and will converge to a better point than if we used random weights.
How to use pre-trained weights
So when we say to use pre-trained weights we mean use the layers which hold the representations to identify cats but discard the last layer (dense and output) and instead add fresh dense and output layers with random weights. So our predictions can make use of the representations already learned.
In real life we freeze our pretrained weights during the initial training as we do not want our random weights at the bottom to ruin the learned representations. we only unfreeze the representations in the end after we have a good classification accuracy to fine-tune them, and that too with a very small learning rate.
Which kind of pre-trained model to use
Always choose those pretrained weights that you know has the most amount of representations which can help you in identifying the class you are interested in.
So will using a mnist digits trained weights give relatively bad results when compared with one trained on image net?
Yes, but given that the initial layers have already learned simple patterns like lines and curves for digits using these weights will still put you at an advantage when compared to starting from scratch in most of the cases.