7

I'm writing a compiler that's generating LLVM IR instructions. I'm working extensively with vectors.

I would like to be able to sum all the elements in a vector. Right now I'm just extracting each element individually and adding them up manually, but it strikes me that this is precisely the sort of thing that the hardware should be able to help with (as it sounds like a pretty common operation). But there doesn't seem to be an intrinsic to do it.

What's the best way to do this? I'm using LLVM 3.2.

David Given
  • 13,277
  • 9
  • 76
  • 123

1 Answers1

5

First of all, even without using intrinsics, you can generate log(n) vector additions (with n being vector length) instead of n scalar additions, here's an example with vector size 8:

define i32 @sum(<8 x i32> %a) {
  %v1 = shufflevector <8 x i32> %a, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %v2 = shufflevector <8 x i32> %a, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %sum1 = add <4 x i32> %v1, %v2
  %v3 = shufflevector <4 x i32> %sum1, <4 x i32> undef, <2 x i32> <i32 0, i32 1>
  %v4 = shufflevector <4 x i32> %sum1, <4 x i32> undef, <2 x i32> <i32 2, i32 3>
  %sum2 = add <2 x i32> %v3, %v4
  %v5 = extractelement <2 x i32> %sum2, i32 0
  %v6 = extractelement <2 x i32> %sum2, i32 1
  %sum3 = add i32 %v5, %v6
  ret i32 %sum3
}

If your target has support for these vector additions then it seems highly likely the above will be lowered to use those instructions, giving you performance.

Regarding intrinsics, there are no target-independent intrinsics to handle this. If you're compiling to x86, though, you do have access to the hadd instrinsics (e.g. llvm.x86.int_x86_ssse3_phadd_sw_128 to add two <4 x i32> vectors together). You'll still have to do something similar to the above, only the add instructions could be replaced.

For more information about this you can search for "horizontal sum" or "horizontal vector sum"; for instance, here are some relevant stackoverflow questions for a horizontal sum on x86:

Community
  • 1
  • 1
Oak
  • 26,231
  • 8
  • 93
  • 152
  • I totally hadn't thought of that. It's elegant and works well --- I'm using floats and doubles, and the above code turns into six and seven instructions repeatedly. I don't think I'm going to do much better than that (particularly if the code is optimisation-friendly). Also thanks for the magic keywords which make Google happy; so much easier to find stuff when you know what it's called... – David Given Feb 07 '13 at 10:36