The inproc-ness is certainly going to be a big part of it. I'm surmising that for the inproc transport there's a bare minimum of interaction with the operating system; thus with the minimum of OS overheads (message transfer is probably little more than memcpy's and possibly a semaphore or two, or similar) it's as fast as can be.
Compared to the other transports, ipc, tcp, etc; they're all reaching down into bits of the OS that are subject to a lot of work. For example, ipc (pipes) involve copying from a source buffer into an OS buffer, and then copying back out from that to the destination buffer, plus all the transitions from user to OS execution contexts, and there's more of those if the messages are > 4kB long (or whatever the system page size is). With the inproc transport the transitions aren't there (maybe one or two for the semaphores) and possibly one less memcpy. Similarly delving into the tcp stack is asking for a lot of variability.
The PAIR too has the minimum of complexity and overhead to the distribution pattern. It's strictly one to one, no more. So that too is low on overhead. That's my reading of this section in The Guide, which you've already come across. PUB/SUB, etc all have more going on, more than is necessary for one-2-one communications.
The minimum of OS interaction and complexity combines to minimise the latency. The minimum of OS interaction will also on some platforms help keep the latency fairly consistent.
I'm not deeply knowledgeable of the innards of ZeroMQ, but there's a good chance that inproc+PAIR on top of an real time OS gives a very good consistency in latency. Often it's the consistency in latency that matters as much as the shortness of the delays.