I am just curious because I can not find my answer Googling, is it possible to optimize further code that uses MPI by using vectorisation like SSE or higher version of SSE.
What I am interested is try to exploit more performance from the CPU, if possible. This is an idea for my thesis but I still have to discuss it with my mentor and see what we can come up with.
If yes, can you please give me a reference where I can start reading. :D