Xeon Phi coprocessor vs Xeon Phi host processor?

Question

What is the difference between a host processor and coprocessor? Specifically Xeon Phi coprocessor and Xeon Phi host processor?

I have some performance results on these machines (a parallelized OpenMP code of diffusion equation was being run) which shows that the host processor works much faster when the same number of threads are working. I would like to know differences and relate them to my results.

What is exact model of Phi in your machines? Do you ask about execution modes (models) - https://software.intel.com/en-us/articles/intel-xeon-phi-programming-environment - they are named "Offload" / "Coprocessor native" / "Symmetric"? Cores of host CPU (not Phi, but some standard Xeon E3/E5) are usually faster than Phi cores on scalar code; but Phi has lot of cores and they are capable of executing vectorized code. — osgx, Oct 28 '15 at 03:39
There are no Xeon Phi host processors yet. You have a Xeon host and a Xeon Phi coprocessor. The performance asymmetry for the same number of threads is easily understood if you read the published material on Xeon Phi. There's a few books on this you might want to find online. — Jeff Hammond, Oct 28 '15 at 03:40
@osgx The model is: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz - It seems the runs were related to execution mode. I know the coprocessor run was as coprocessor native execution mode, but I'm not sure about host processor case. Do you think it should be offload mode? — Amir, Oct 28 '15 at 05:25
@Jeff I found this document: [link](https://software.intel.com/sites/default/files/article/330164/an-overview-of-programming-for-intel-xeon-processors-and-intel-xeon-phi-coprocessors_1.pdf) - It looks as you mentioned Xeon Phi coprocessor are slower but can be used in larger numbers, right? So what's the reason? Older technology? — Amir, Oct 28 '15 at 05:28
Xeon Phi is based upon Pentium core from 1995 (P54C). It lacks the monster reorder buffer and prefetch capability of modern Xeon cores. In addition, it is single-issue per thread, dual-issue per core (Xeon is something like six-issue now) and runs at a low frequency relative to modern Xeon cores. However, since they are smaller cores running at a lower frequency, one can pack many more into a single die, hence the aggregate performance will be higher for highly concurrent workloads. Plus Xeon Phi is 512b SIMD, which Xeon won't have until Skylake. — Jeff Hammond, Nov 01 '15 at 19:25
https://software.intel.com/en-us/articles/intel-xeon-phi-core-micro-architecture has terrific details on Xeon Phi uarch. — Jeff Hammond, Nov 01 '15 at 19:29
Based upon docs online, recent Xeon processors are quad-issue but have up to 8 instruction ports. http://www.agner.org/optimize/instruction_tables.pdf has details. — Jeff Hammond, Nov 01 '15 at 19:43

score 5 · Accepted Answer · answered Oct 28 '15 at 16:09

Just to re-iterate what Jeff said in the comments, you have a Xeon host with an attached Xeon Phi coprocessor. The current generation of Xeon Phi (Knight's Corner) is only available as a coprocessor, not as a standalone Xeon Phi host (which should be available next generation with Knight's Landing).

When you run your program without offloading from your host Xeon, from this website, it looks like you'll be able to run with up to 16 threads. Note that the speed of each of your cores is about 2.2 GHz.

When you run your program in native execution mode on your Xeon Phi coprocessor, you should be able to run with a lot more threads. The optimal number of threads to use depends on the model of Xeon Phi you have (some work best with 56, others with 60). But note that each Xeon Phi core (roughly 1.2 GHz) is noticeably weaker than a single Xeon core (roughly 2.2 GHz). The benefit of the many-core Xeon Phi technology is exactly that: you can run across many cores.

The last very important thing to consider is that the Xeon Phi has a 512-bit wide SIMD instruction set. Thus, you can support much better SIMD vectorization running on the Xeon Phi coprocessor than on the host. In your case, I believe your Xeon host only has a 256-bit SIMD vector processing unit. Therefore, if you haven't already, you can improve your performance (up to x16 if you're dealing in single-precision) on your Xeon Phi taking advantage of SIMD vectorization. Your Xeon host will only give up to x8 performance. Just to start you on a google trek, OpenMP 4.0 allows you to write things like #pragma omp simd in order to tell the compiler when to vectorize lower-level loops throughout your code. If you really want maximum performance from the Xeon Phi, adding SIMD vectorization is a necessity.

So to directly answer your question: comparing the performance results between your Xeon host and Xeon Phi coprocessor using the same number of cores is useless. We already know that each Xeon Phi core is slower than each Xeon core. You should be comparing the results using the maximum number of cores each allows (60, and 16 respectively) and taking maximum advantage of the vector processing unit if you want a direct comparison.

Good answer - just a couple of notes: each core on the coprocessor has 4 threads for a total of 240 threads on a 60 core coprocessor. Each thread issues an instruction, at most, every other clock. So, it takes at least 2 threads per core to keep each core busy. So depending on cache behavior of your code and how much parallelism there is, you can sometimes get better performance using less that the maximum number of core. However, if you have the parallelism, using all the cores (but 1 - want to leave one for OS, etc.) max_cores X 3 or 4 threads per core is optimal. — froth, Oct 28 '15 at 17:09
@froth true. The only reason I didn't add that to my answer is that it hasn't been reflective of my personal experience. With a 60 core coprocessor, I usually see best performance using 60 threads (or sometimes 120 threads). But I've only seen performance degrade whenever I've added more threads past that point. This has been something I've retested every time I've run code on the Xeon Phi though because what you've mentioned is supposed to be true. — NoseKnowsAll, Oct 28 '15 at 17:21
@froth Thanks you guys for comprehensive reply. I agree your claims because my results for 120 threads were a bit faster than 240 threads when I used Gauss-Seidel linear solver. But by Jacobi linear solver, I got almost the same speed for 120 and 240 threads. So the algorithm and solver might also influence this matter. — Amir, Oct 28 '15 at 19:34
Wait, are you telling me that KNL can be used as a CPU like a Xeon processor but instead I plug in a KNL processor? What kind of motherboards will use this? Can I plug a Xeon into the same motherboard as a KNL? Or will it be like the x87 and be a co-processor? I had no idea about this. — Z boson, Oct 29 '15 at 08:00
@Zboson Yes. The next generation KNL will be available as a x86 chip. It will also be available as a coprocessor like the KNC that plugs into a PCI-e port. I don't know that much more beyond that. — NoseKnowsAll, Oct 29 '15 at 15:09

score 1 · Answer 2 · answered Oct 28 '15 at 16:09

If you are talking about the current generation (KNC) and not the next (KNL), these are the definitions.

Host processor: The ~8 core/ ~16 thread Xeon that is hosting the coprocessor, meaning the Xeon host off of which the coprocessor is connected via the PCIe bus.

Coprocessor: The ~60 core/~240 thread coprocessor that is hanging off of your Xeon host on the Xeon's PCIe bus.

The host farms off highly parallel / vectorizeable jobs to the coprocessor using either offload instructions or by running them natively using some distributed programming paradigm such as MPI.

As to the comment about the next generation host processor, the commenter is referring to the fact that the next generation Xeon Phi (KNL) can be configured either as a coprocessor hanging off the PCIe bus (like the 1st gen Xeon Phi, KNC) or as a normal processor that you plug into a motherboard.

Xeon Phi coprocessor vs Xeon Phi host processor?

2 Answers2