4

I am trying to create a view of an array object to better utilise SIMD vectors on the x86_64 platform.

Here's the main idea:

   type Char_Set_Index is range 0 .. 7;
   type Char_Set_Element is mod 2 ** 32;
   
   type Character_Set_Vector is array (Char_Set_Index) of Char_Set_Element
     with Alignment => 32,Component_Size => 32, Object_Size => 256, Size => 256;
   
   type Character_Set is array (Character) of Boolean
     with Alignment => 32, Component_Size => 1, Object_Size => 256, Size => 256;

Essentially, some of the operations in Ada.Character.Maps can better be processed using SIMD arithmetic. For instance the "=" operation, perhaps coded as,

   function "="
     (Left, Right : in Character_Set)
      return Boolean
   is
     (for all k in Character_Set'Range =>
         (Left(k) = Right(k)));

.. gives us the following output

.LFB4:
    .cfi_startproc
    movq    %rdi, %r8
    movq    %rsi, %rdi
    xorl    %esi, %esi
    jmp .L6
    .p2align 4,,10
    .p2align 3
.L10:
    addl    $1, %esi
    cmpl    $256, %esi
    je  .L9
.L6:
    movl    %esi, %edx
    movl    %esi, %ecx
    sarl    $3, %edx
    andl    $7, %ecx
    movslq  %edx, %rdx
    movzbl  (%rdi,%rdx), %eax
    xorb    (%r8,%rdx), %al
    shrb    %cl, %al
    testb   $1, %al
    je  .L10
    xorl    %eax, %eax
    ret
.L9:
    movl    $1, %eax
    ret
    .cfi_endproc

Critically, it is comparing each bit, and GCC won't vectorise it. However, if we write,

   function "="
     (Left, Right : in Character_Set)
      return Boolean
   is
      
      u : aliased constant Character_Set_Vector
        with Import, Address => Left'Address;
      
      v : aliased constant Character_Set_Vector
        with Import, Address => Right'Address;
   
      Temp : array (Char_Set_Index) of Integer;
      Sum  : Integer;
   
   begin
   
      for j in Temp'Range loop
         pragma Loop_Optimize (Vector);
         Temp(j) := (if u(j) = v(j) then 0 else 1);
      end loop;
   
      Sum := 0;
      for j in Temp'Range loop
         Sum := Sum + Temp(j);
      end loop;
   
      return Sum = 0;
   
   end "=";

We get the branch-free SIMD instructions that we kind of expect,

    .cfi_startproc
    vmovdqa (%rdi), %ymm1
    vpcmpeqd    (%rsi), %ymm1, %ymm1
    vpandn  .LC0(%rip), %ymm1, %ymm1
    vextracti128    $0x1, %ymm1, %xmm0
    vpaddd  %xmm1, %xmm0, %xmm0
    vpsrldq $8, %xmm0, %xmm1
    vpaddd  %xmm1, %xmm0, %xmm0
    vpsrldq $4, %xmm0, %xmm1
    vpaddd  %xmm1, %xmm0, %xmm0
    vmovd   %xmm0, %eax
    testl   %eax, %eax
    sete    %al
    vzeroupper
    ret
    .cfi_endproc

Which all works rather well. Now, the problem at hand. If you push this code through SPARK Ada, there are a number of complaints regarding alignment, aliasing, and constants, so you have to end up writing,

   function "="
     (Left, Right : in Character_Set)
      return Boolean
   is
      
      Left_Aligned : constant Character_Set := Left
        with Alignment => 32;
      
      Right_Aligned : constant Character_Set := Right
        with Alignment => 32;
      
      u : aliased constant Character_Set_Vector
        with Import, Alignment => 32, Address => Left_Aligned'Address;
      
      v : aliased constant Character_Set_Vector
        with Import, Alignment => 32, Address => Right_Aligned'Address;
   
      Temp : array (Char_Set_Index) of Integer;
      Sum  : Integer;
   
   begin
   
      for j in Temp'Range loop
         pragma Loop_Optimize (Vector);
         Temp(j) := (if u(j) = v(j) then 0 else 1);
      end loop;
   
      Sum := 0;
      for j in Temp'Range loop
         Sum := Sum + Temp(j);
      end loop;
   
      return Sum = 0;
   
   end "=";

which gives us an awful lot of precopying, presumably to ensure that everything is aligned OK - even though the declarations already have the correct alignment,

    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    andq    $-32, %rsp
    vmovdqa (%rdi), %xmm2
    vmovdqa 16(%rdi), %xmm3
    vmovdqa (%rsi), %xmm4
    vmovdqa 16(%rsi), %xmm5
    vmovdqa %xmm2, -64(%rsp)
    vmovdqa %xmm3, -48(%rsp)
    vmovdqa -64(%rsp), %ymm6
    vmovdqa %xmm4, -32(%rsp)
    vmovdqa %xmm5, -16(%rsp)
    vpcmpeqd    -32(%rsp), %ymm6, %ymm1
    vpandn  .LC0(%rip), %ymm1, %ymm1
    vextracti128    $0x1, %ymm1, %xmm0
    vpaddd  %xmm1, %xmm0, %xmm0
    vpsrldq $8, %xmm0, %xmm1
    vpaddd  %xmm1, %xmm0, %xmm0
    vpsrldq $4, %xmm0, %xmm1
    vpaddd  %xmm1, %xmm0, %xmm0
    vmovd   %xmm0, %eax
    testl   %eax, %eax
    sete    %al
    vzeroupper
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc

Obviously, the only reason one would even bother with this is for greater performance, however, the SPARK Ada rules seem too restrictive in this case, hurting performance. So, my question is, is there a better way of doing this that doesn't result in the excessive moving data around, where, as far as I can tell, it's not required.

Incidentally, Ada.Unchecked_Conversion similarly does a lot of moving data around at the beginning, too.

Also, I realise that I can justify the SPARK Ada checks (false-positive) so I can use the Ada version, but I am hoping that I am missing something, here, and that there is an easier way to do this.

Perhaps there is a way of vectorising arrays of Booleans?

EDIT: I am compiling it using

gnatmake -O3 -mavx2 -gnatn -gnatp -S name-of-package.adb
egilhh
  • 6,464
  • 1
  • 18
  • 19
Mark
  • 153
  • 6

3 Answers3

3

The question of why the alignment of Left and Right is unknown within the body of the function is interesting. You indeed can neither assert on the alignment attribute nor add a precondition to the function stating a requirement on parameter alignment (at least for GNATprove FSF 11.2.0). There is some comment on the issue in the SPARK source code though (see line 3276 in spark_definition.adb).

On the other hand, it seems that you can work around the additional copying of the unchecked conversion by applying the conversion in the loop. Below is what I was able to achieve with GNAT FSF 11.3.1:

character_sets.ads

package Character_Sets with SPARK_Mode is
   
   type Character_Set is array (Character) of Boolean
     with 
       Alignment      => 32,
       Component_Size => 1, 
       Object_Size    => 256, 
       Size           => 256;
   
   function "=" (Left, Right : in Character_Set) return Boolean;

end Character_Sets;

character_sets.adb

with Ada.Unchecked_Conversion;

package body Character_Sets with SPARK_Mode is

   type Char_Set_Index is range 0 .. 7;
   type Char_Set_Element is mod 2 ** 32;
   
   type Character_Set_Vector is array (Char_Set_Index) of aliased Char_Set_Element
     with 
       Alignment      => 32,
       Component_Size => 32, 
       Object_Size    => 256, 
       Size           => 256;
   
   function To_Vector is new Ada.Unchecked_Conversion
     (Source => Character_Set,
      Target => Character_Set_Vector);
   
   ---------
   -- "=" --
   ---------

   function "=" (Left, Right : in Character_Set) return Boolean is
      
      Temp : array (Char_Set_Index) of Integer;
      Sum  : Integer;
   
   begin
     
      for J in Temp'Range loop
         pragma Loop_Optimize (Vector);
         Temp (J) := (if To_Vector (Left) (J) = To_Vector (Right) (J) then 0 else 1);    --  !!!
      end loop;
   
      Sum := 0;
      for J in Temp'Range loop
         Sum := Sum + Temp (J);
      end loop;
   
      return Sum = 0;
   
   end "=";
   
end Character_Sets;

default.gpr

project Default is

   for Source_Dirs use ("src");
   for Object_Dir use "obj";
   for Main use ();

   package Compiler is
      for Switches ("ada") use ("-O3", "-mavx2", "-gnatn", "-gnatp");
   end Compiler;

end Default;

output (objdump)

$ objdump -d -M intel ./obj/character_sets.o 

./obj/character_sets.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <character_sets__Tcharacter_setBIP>:
   0:   c3                      ret    
   1:   90                      nop
   2:   66 66 2e 0f 1f 84 00    data16 cs nop WORD PTR [rax+rax*1+0x0]
   9:   00 00 00 00 
   d:   0f 1f 00                nop    DWORD PTR [rax]

0000000000000010 <character_sets__Tcharacter_set_vectorBIP>:
  10:   c3                      ret    
  11:   90                      nop
  12:   66 66 2e 0f 1f 84 00    data16 cs nop WORD PTR [rax+rax*1+0x0]
  19:   00 00 00 00 
  1d:   0f 1f 00                nop    DWORD PTR [rax]

0000000000000020 <character_sets__Oeq>:
  20:   c5 fd 6f 0f             vmovdqa ymm1,YMMWORD PTR [rdi]
  24:   c5 f5 76 0e             vpcmpeqd ymm1,ymm1,YMMWORD PTR [rsi]
  28:   c5 f5 df 0d 00 00 00    vpandn ymm1,ymm1,YMMWORD PTR [rip+0x0]        # 30 <character_sets__Oeq+0x10>
  2f:   00 
  30:   c4 e3 7d 39 c8 01       vextracti128 xmm0,ymm1,0x1
  36:   c5 f9 fe c1             vpaddd xmm0,xmm0,xmm1
  3a:   c5 f1 73 d8 08          vpsrldq xmm1,xmm0,0x8
  3f:   c5 f9 fe c1             vpaddd xmm0,xmm0,xmm1
  43:   c5 f1 73 d8 04          vpsrldq xmm1,xmm0,0x4
  48:   c5 f9 fe c1             vpaddd xmm0,xmm0,xmm1
  4c:   c5 f9 7e c0             vmovd  eax,xmm0
  50:   85 c0                   test   eax,eax
  52:   0f 94 c0                sete   al
  55:   c5 f8 77                vzeroupper 
  58:   c3                      ret

output (gnatprove)

$ gnatprove -P ./default.gpr -f
Phase 1 of 2: generation of Global contracts ...
Phase 2 of 2: flow analysis and proof ...

character_sets.adb:31:10: warning: pragma "Loop_Optimize" ignored (not yet supported)
   31 |         pragma Loop_Optimize (Vector);
      |         ^ here
Summary logged in /home/deedee/72423385-spark-ada-overlays-without-copying/obj/gnatprove/gnatprove.out
DeeDee
  • 5,654
  • 7
  • 14
2

Here's the resulting (over-optimised) function after DeeDee's solution,

 function "="
 (Left, Right : in Character_Set)
  return Boolean
is
   Temp : array (Char_Set_Index) of Integer;
   Sum  : Integer;
begin

  for j in Temp'Range loop
     Temp(j) := (if To_Vector(Left)(j) = To_Vector(Right)(j) then -1 else 0);
  end loop;

  Sum := 0;
  for j in Temp'Range loop
     Sum := Sum + Temp(j);
  end loop;
  
  return Sum = -Temp'Length;

 end "=";

Note the change of Temp's values, to match up with Intel's documentation to match properly the result of vpcmpeqd For all that effort (and complication) you get to drop one vpand

Also, it seems possible after moving the vector array into the body instead being private in the specification, allows you to drop the pragma Loop_Optimize

Indeed, if you don't have SIMD available you get,

    .cfi_startproc
    movl    (%rsi), %eax
    cmpl    %eax, (%rdi)
    sete    %dl
    movl    4(%rsi), %ecx
    xorl    %r9d, %r9d
    movl    8(%rsi), %r10d
    movzbl  %dl, %r8d
    movl    12(%rsi), %eax
    negl    %r8d
    cmpl    %ecx, 4(%rdi)
    movl    16(%rsi), %ecx
    sete    %r9b
    xorl    %r11d, %r11d
    subl    %r9d, %r8d
    cmpl    %r10d, 8(%rdi)
    movl    20(%rsi), %r10d
    sete    %r11b
    xorl    %edx, %edx
    subl    %r11d, %r8d
    cmpl    %eax, 12(%rdi)
    movl    24(%rsi), %eax
    sete    %dl
    xorl    %r9d, %r9d
    movl    28(%rsi), %esi
    subl    %edx, %r8d
    cmpl    %ecx, 16(%rdi)
    sete    %r9b
    xorl    %r11d, %r11d
    subl    %r9d, %r8d
    cmpl    %r10d, 20(%rdi)
    sete    %r11b
    xorl    %edx, %edx
    subl    %r11d, %r8d
    cmpl    %eax, 24(%rdi)
    sete    %dl
    xorl    %ecx, %ecx
    subl    %edx, %r8d
    cmpl    %esi, 28(%rdi)
    sete    %cl
    subl    %ecx, %r8d
    cmpl    $-8, %r8d
    sete    %al
    ret
    .cfi_endproc

with,

gnatmake -O2 -funroll-loops -gnatn -gnatp -S name-of-package.adb

which, if you want to avoid branching, seems better than the naieve version

Mark
  • 153
  • 6
2

My first thought on seeing this was, 'Why are you defining "=" for Character_Set?' It comes with "=" predefined.

Let's see what it does:

package Packed_Vectorization is
   type CS is array (Character) of Boolean with
      Component_Size => 1, Size => 256;
  
   type Character_Set is new CS with
      Component_Size => 1, Size => 256;
  
   function "=" (Left : in Character_Set; Right : in Character_Set) return Boolean is
      (CS (Left) = CS (Right) );
end Packed_Vectorization;

The type derivation is there so we can see what code is produced for the predefined "=".

Compiling with

gnatmake -gnatnp -O3 -S packed_vectorization.ads

gives the important part as

packed_vectorization__Oeq:
.LFB2:
    .cfi_startproc
    movq    %rsi, %rdx
    movl    $256, %ecx
    movl    $256, %esi
    jmp system__bit_ops__bit_eq@PLT
    .cfi_endproc

The compiler has a special function just for comparing bit-packed arrays, presumably to optimize this common action. You can look at the implementation of System.Bit_Ops.Bit_Eq; the important part seems to be

if LeftB (1 .. BLen) /= RightB (1 .. BLen) then

where Leftb and Rightb are views of the two arrays as packed arrays of bytes. This is the predefined "/=" for the array-of-bytes type. I was unable to find an object file for System.Bit_Ops, but I'd guess that that "/=" is optimized, too.

Is this acceptable for your use? (I presume you need to optimize your "=" in order to meet your quantified timing requirements, as otherwise there's no reason to worry about this.) If so, then a lot of effort has been expended for nothing.

"Ada Outperforms Assembly: A Case Study", Proceedings of TRI-Ada '92, reports on an Ada (83) compiler producing faster and smaller code than assembler hand optimized by a team of experts. That was 30 years ago. Optimizer technology has no doubt improved since then. Typically, the compiler knows more about optimization than any of us ever will.

"Premature optimization is the root of all evil ..." -- Donald Knuth

Jeffrey R. Carter
  • 3,033
  • 9
  • 10
  • Extremely similar for aarch64. – Simon Wright May 30 '22 at 14:11
  • 1
    For completion: the full source code for `System.Bit_Ops.Bit_Eq` can be found [here](https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/ada/libgnat/s-bitops.adb#l101). This routine seems to anticipate for the generic case in which the array might be unaligned (if I read correctly), so advanced optimizations like vectorization (SSE, etc.) are also somewhat unlikely to be applied. – DeeDee May 30 '22 at 19:33
  • 1
    ... although it is true that simply returning `To_Vector (Left) = To_Vector (Right)` (with `To_Vector` as defined in my [answer](https://stackoverflow.com/a/72424391/10975520)) does indeed give an interesting result: no SIMD, no call to `System.Bit_Ops.Bit_Eq`, but otherwise simple and only 2 branch instructions. The combination of alignment and size seem to be the key factors here. – DeeDee May 30 '22 at 20:31
  • 1
    ""We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." In other words we should not pass up the opportunity to optimise (on average) 1 in 33 lines of code - not leave it until the end, when you *hope* that you'll be able to change your design to factor out inefficiencies identified by profiling. – Mark May 31 '22 at 10:16
  • GCC has long been problematic for identifying and optimising to SIMD. That's what the '__builtins' are for - a workaround. In this case, human beings with a modicum of experience *are* better than the compiler since we can spot emminently parallel loops far better than GCC. A better argument is to hide the use of SIMD inside it's own package and inline the lot, but then you get into having to ship the body with the spec which comes with it's own problems. – Mark May 31 '22 at 10:25
  • Ada.Character.Maps redefines "=", too. And there's an ugly Character_Map_Internal to make it work and to avoid infinite recursion - it uses this ugly type to access the predefined, presumably Bit_Ops, operations, where, despite the computers capabilities we are stuck with using 256/8 comparisons, no matter what compile flags we use, instead of 256/32 comparisons. We could have 256/64 comparisons but GCC just will not see that type as a vector even though, of course, it can be. – Mark May 31 '22 at 10:27
  • "Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail." You have not demonstrated that the predefined `"="` for `Character_Set` is critical for your program. – Jeffrey R. Carter Jun 01 '22 at 11:03
  • "The conventional wisdom shared by many of today’s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can’t debug or maintain their “optimized” programs. In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering ... I don’t want to restrict myself to tools that deny me such efficiencies." Knuth – Mark Jun 01 '22 at 12:36
  • Using SIMD (ie improving the data structure that the compiler uses) yields a 25% improvement with virtually no effort apart from the knowledge that SIMD exists. The problem at hand is to get the compiler to stop copying that data structure into subprograms. If you are planning to extensively use SIMD this is with absolutely no doubt a design issue the with intended x86_64 platform – Mark Jun 01 '22 at 12:38
  • "look carefully at the critical code; but only after that code has been identified"; "a 12% improvement, easily obtained, is never considered marginal": taken individually and out of context, these seem to be contradictory. But they are part of the same piece of advice, which is to follow the sequence: 1. Identify the critical code; 2. look carefully at the critical code; 3. if needed, or a significant improvement can be easily obtained, make efficiency improvements to the critical code. Again, you seem to have skipped 1. – Jeffrey R. Carter Jun 02 '22 at 11:15
  • I didn't skip (1) This is for a targeted x86_64 runtime, for which I have no idea, whatsoever, what an end user might require. Sure, I can drivel out some code that works, but given the capabilities of the platform at hand and for all the software engineering philosophies we might argue about, I'd rather it was correct (SPARK Ada) and it was fast (SIMD) I'm sorry if I was not clearer. – Mark Jun 07 '22 at 17:09
  • "I have no idea what an end user requires" is the opposite of "I have identified the critical code", so if you can believe they are the same, there is clearly no point in continuing this discussion. If I presume that you are implementing the standard library (/= the runtime) given in ARM Annex A, and specifically Ada.Strings.Maps, then I note that the GNAT implementation of this, which uses the predefined `"="`, has been in use for at least 15 years, and so is clearly fast enough. – Jeffrey R. Carter Jun 09 '22 at 13:29