HPC programming language relying on implicit vectorization

Question

Are there programming languages or language extensions that rely on implicit vectorization?

I would need something that make aggressive assumptions to generate good DLP/vectorized code, for SSE4.1, AVX, AVX2 (with or without FMA3/4) in single/double precision from scalar C code.

For the last 10 years I had fun relying on the Intel's intrinsics to write my HPC kernels, explicitly vectorized. At the same time I have been regularly disappointed by the quality of the DLP code generated by C/C++ compilers (GCC, clang, LLVM, etc., in case you ask, I can post specific examples).

From the Intrinsics Guide, it is clear that writing "manually" HPC kernels with intrinsics for modern platforms is no longer a sustainable option, unless I have an army of programmers. Too many versions and combinations: SSE4.1, AVX, AVX2, AVX512+flavors, FMA, SP, DP, half precision? It's just not sustainable if my target platforms are, let say, the most widespread ones since 2012.

I recently tried the Intel Offline Compiler for OpenCL (CPU). I wrote the kernel "a la CUDA" (i.e. scalar code, implicit vectorization), and to my surprise the generated assembly was very well vectorized! (Skylake, AVX2 + FMA in SP) The only limitation I encountered was the lack of builtin functions for data reductions/interworkitem-communication without relying on the shared memory (that would translate into CPU horizontal adds, or shuffles + min/max operations).

As pointed out by clemens and sschuberth the offline compiler is not really a solution unless I do not embrace fully OpenCL. Or I hack my caller code to comply to the calling convention of the generated assembly, which includes parameters that I would not need such as ndrange. Fully embracing OpenCL is not an option for me either, since for TLP I rely on OpenMP and Pthreads (and for ILP I rely on the hardware).

Update

First off, it's worth to recall that implicit vectorization and autovectorization are not the same thing. In fact, I lost my hope in autovectorization (as mentioned above). Not in the implicit vectorization.

One of the answers below is asking for some code examples. Here I provide a code example of a kernel implementing a third-order upwind scheme for the convection term of the NSE on a 3D structured block. It is worth to mention that this represents a trivial example since no SIMD inter-lane cooperation/communication is required.

Have you tried to help the compiler to auto-vectorize? Like forbidding aliasing, etc.. — Hopobcn, Feb 01 '16 at 11:27
Have you ever looked at Intel Cilk++ array notation? This is designed specifically for the DLP work: don't confuse it with the task-based parallelism in Cilk++. This notation allows you to clearly state assumptions, without having to rely on __restrict__ etc. — bcumming, Feb 01 '16 at 13:32
Using OpenMP and pthreads is not a limitation to use OpenCL. OpenCL is completely thread safe, as long as multiple threads do not use the same buffer, you can create different buffers and kernel from different threads and launch them. OpenCL will launch the works and notify the waiting thread when done. In fact, I did use it under that exact case without any problem. I got a big speed up since the GPU part was 20% of my total job, therefore I could run 5 threads in parallel using the same underlying GPU power. — DarkZeros, Feb 01 '16 at 13:56

diegor · Accepted Answer · 2018-10-16T09:56:16.650

Intel SPMD Program Compiler.

At the present time, the best option is the Intel SPMD Program Compiler. ISPC is an open source compiler, its programming model relies on implicit vectorization (term borrowed from the Intel OpenCL SDK documentation) to output vectorized assembly code. ISPC maps source codes to SSE4.1, AVX, AVX2, KNC and KNL's AVX512 instructions for both SP/DP. ISPC's backend is LLVM.

For CFD kernels it simply delivers unmatched performance. For the portions of code that have to be scalar, one simply adds the "uniform" keyword to the associated variables. There are built-in functions for inter-lane communication such as shuffle, broadcast and reduce_add, etc.

Why is ISPC so fast compared to the other C++ compilers? My guess is that because the C/C++ compilers assume that nothing can be vectorized unless there is clear evidence of the opposite. ISPC assumes that every line of code is (independently) executed by all SIMD lanes, unless otherwise specified.

I wonder why ISPC is not widely embraced yet. Maybe it is because of his juvenile stage, but it showed already great capabilities (Embree, OSPray) in the CG/Scientific Visualization community. ISPC is a good option for writing HPC kernels as it seems to nicely bridge the performance-productivity gap.

Benchmark

For the trivial kernel example referenced in the question, the following results were obtained using GCC 4.9.X and ISPC 1.8.2. Performance is reported in terms of FLOPs per cycle.

ICC results are not reported herein (in terms of accessibility, is it 100% fair to report ICC against free and open-source compilers?). Nonetheless the maximum gain of ICC over GCC reporting in this case was about 4X, therefore not compromising the superiority of ISPC.

I don't think you understood my answers very well. OpenMP 4.0 `pragma omp simd` has nothing to do with threads, nor does CilkPlus `pragma simd` or Fortran-like array notation. These are implemented using SIMD instructions, not threads. Your assertion about the conservative nature of the Intel C/C++ compiler for vectorization is wrong, except insofar as ISO C/C++ lack explicit vectorization. I would like to see your code and data showing that ISPC is faster than ICC, because I work with the people who develop both. Did you really write C/C++ code that is semantically equivalent to ISPC? — Jeff Hammond, Feb 09 '16 at 13:37

score 7 · Answer 2 · edited May 23 '17 at 12:00

Note that, without a mathematical or code example, it's hard to know what the best answer is here. If you provide a code example, I'll try to implement it in some of the dialects noted below.

Fortran 90

Fortran 90+ colon notation is a great way to realize implicit vectorization, although I suspect Fortran is not something you are willing to use if you're a C intrinsics programmer.

One reasonable source of information on this topic is fortran90.org.

OpenMP 4.0

OpenMP 4.0 introduced the SIMD keyword, which compels the compiler to vectorize code. You should look into that as an alternative to intrinsics.

There are plenty of examples of OpenMP 4.0 pragma omp simd online, but a very simple one is Enabling SIMD in program using OpenMP4.0.

Obviously, the final authority on OpenMP is the latest specifiction: OpenMP Application Programming Interface Version 4.5.

CilkPlus

Since you have already indicated that you are willing to write less-than-ISO-standard code, you may be willing to use the CilkPlus extensions to C/C++ supported by the Intel compiler and GCC (and possible Clang/LLVM, but I haven't verified).

See Best practices for using Intel® Cilk™ Plus and the CilkPlus home page for details.

OpenCL

OpenCL is another good option in theory, but in practice it seems less compelling. I am not an OpenCL user myself, but I work with an author of OpenCL Programming Guide, who I consider to be a reliable source.

Autovectorization

If all else fails, the Intel 16 compiler does a pretty good job autovectorizing, but you have to read the opt reports, use restrict and __assume_aligned in many cases.

The best place to start when trying to achieve autovectorization with Intel C/C++ is the -qopt-report compiler option. This will usually tell you what is vectorized and not, as well as why. You may need to use a different allocator (Why use _mm_malloc? (as opposed to _aligned_malloc, alligned_alloc, or posix_memalign) lists the relevant options), and then use __assume_aligned in your kernel. Vector dependencies can be harder to mitigate, although AVX-512CDI instructions may help, provided you use the second-generation Intel Xeon Phi processor (aka Knights Landing) or another product that supports them.

The Cray compiler also autovectorizes quite well, but is limited to users who have access to a Cray supercomputer.

For those that are curious, my optimism about these compilers is based upon results obtained with NWChem kernels. The best results are obtained with Fortran 77, OpenMP 3/4 and use of other compiler directives, but at least there is no processor-specific code in there. And the C99 kernels vectorize well enough.

Disclaimer

I work in a research/pathfinding capacity at Intel. I do not work on any of our software products, but I get to learn from the experts in the compiler team from time to time.