Are there programming languages or language extensions that rely on implicit vectorization?
I would need something that make aggressive assumptions to generate good DLP/vectorized code, for SSE4.1, AVX, AVX2 (with or without FMA3/4) in single/double precision from scalar C code.
For the last 10 years I had fun relying on the Intel's intrinsics to write my HPC kernels, explicitly vectorized. At the same time I have been regularly disappointed by the quality of the DLP code generated by C/C++ compilers (GCC, clang, LLVM, etc., in case you ask, I can post specific examples).
From the Intrinsics Guide, it is clear that writing "manually" HPC kernels with intrinsics for modern platforms is no longer a sustainable option, unless I have an army of programmers. Too many versions and combinations: SSE4.1, AVX, AVX2, AVX512+flavors, FMA, SP, DP, half precision? It's just not sustainable if my target platforms are, let say, the most widespread ones since 2012.
I recently tried the Intel Offline Compiler for OpenCL (CPU). I wrote the kernel "a la CUDA" (i.e. scalar code, implicit vectorization), and to my surprise the generated assembly was very well vectorized! (Skylake, AVX2 + FMA in SP) The only limitation I encountered was the lack of builtin functions for data reductions/interworkitem-communication without relying on the shared memory (that would translate into CPU horizontal adds, or shuffles + min/max operations).
As pointed out by clemens and sschuberth the offline compiler is not really a solution unless I do not embrace fully OpenCL. Or I hack my caller code to comply to the calling convention of the generated assembly, which includes parameters that I would not need such as ndrange. Fully embracing OpenCL is not an option for me either, since for TLP I rely on OpenMP and Pthreads (and for ILP I rely on the hardware).
Update
First off, it's worth to recall that implicit vectorization and autovectorization are not the same thing. In fact, I lost my hope in autovectorization (as mentioned above). Not in the implicit vectorization.
One of the answers below is asking for some code examples. Here I provide a code example of a kernel implementing a third-order upwind scheme for the convection term of the NSE on a 3D structured block. It is worth to mention that this represents a trivial example since no SIMD inter-lane cooperation/communication is required.