The last two answers at this SO answer should be helpful.
The last one pointed me to SciPy documentation, which includes this quote:
"[np.dot(A,B) is evaluated using BLAS, which] will normally be a
library carefully tuned to run as fast as possible on your hardware by
taking advantage of cache memory and assembler implementation. But
many architectures now have a BLAS that also takes advantage of a
multicore machine. If your numpy/scipy is compiled using one of these,
then dot() will be computed in parallel (if this is faster) without
you doing anything."
So it sounds like it depends on your specific hardware and SciPy compilation. Sometimes np.dot(A,B)
will utilize your multiple cores/processors, sometimes it might not.
To find out which case is yours, I suggest running your toy example (with larger matrices) while you have your system monitor open, so you can see whether just one CPU spikes in activity, or if multiple ones do.