I am testing OpenJDK Panama Vector API jdk.incubator.vector and I performed tests on amazon c5.4xlarge instance. But in every case simple unrolled vector dot product is out performing the Vector API code.
My Question is : Why am I not able to get the performance boost as shown in Richard Startin's blog. The same performance improvement is also discussed in this conference meetup by Intel people. What is missing?
JMH benchmark test results :
Benchmark (size) Mode Cnt Score Error Units
FloatVector256DotProduct.unrolled 1048576 thrpt 25 2440.726 ? 21.372 ops/s
FloatVector256DotProduct.vanilla 1048576 thrpt 25 807.721 ? 0.084 ops/s
FloatVector256DotProduct.vector 1048576 thrpt 25 909.977 ? 6.542 ops/s
FloatVector256DotProduct.vectorUnrolled 1048576 thrpt 25 887.422 ? 5.557 ops/s
FloatVector256DotProduct.vectorfma 1048576 thrpt 25 916.955 ? 4.652 ops/s
FloatVector256DotProduct.vectorfmaUnrolled 1048576 thrpt 25 877.569 ? 1.451 ops/s
JavaDocExample.simpleMultiply 1048576 thrpt 25 2096.782 ? 6.778 ops/s
JavaDocExample.simpleMultiplyUnrolled 1048576 thrpt 25 1627.320 ? 6.824 ops/s
JavaDocExample.vectorMultiply 1048576 thrpt 25 2102.654 ? 11.637 ops/s
AWS instance type : c5.4xlarge
CPU details :
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
Stepping: 4
CPU MHz: 3404.362
BogoMIPS: 5999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
Code snippets. Please see complete code in this github repo
JavaDocExample : This is shared in the java doc of vectorIntrinsic branch of OpenJDK.
@Benchmark
public void simpleMultiplyUnrolled() {
for (int i = 0; i < size; i += 8) {
c[i] = a[i] * b[i];
c[i + 1] = a[i + 1] * b[i + 1];
c[i + 2] = a[i + 2] * b[i + 2];
c[i + 3] = a[i + 3] * b[i + 3];
c[i + 4] = a[i + 4] * b[i + 4];
c[i + 5] = a[i + 5] * b[i + 5];
c[i + 6] = a[i + 6] * b[i + 6];
c[i + 7] = a[i + 7] * b[i + 7];
}
}
@Benchmark
public void simpleMultiply() {
for (int i = 0; i < size; i++) {
c[i] = a[i] * b[i];
}
}
@Benchmark
public void vectorMultiply() {
int i = 0;
// It is assumed array arguments are of the same size
for (; i < SPECIES.loopBound(a.length); i += SPECIES.length()) {
FloatVector va = FloatVector.fromArray(SPECIES, a, i);
FloatVector vb = FloatVector.fromArray(SPECIES, b, i);
FloatVector vc = va.mul(vb);
vc.intoArray(c, i);
}
for (; i < a.length; i++) {
c[i] = a[i] * b[i];
}
}
FloatVector256DotProduct : this code is shamelessly copied from Richard Startin's blog. Thanks Richard for insightful blogs.
@Benchmark
public float vectorfma() {
var sum = FloatVector.zero(F256);
for (int i = 0; i < size; i += F256.length()) {
var l = FloatVector.fromArray(F256, left, i);
var r = FloatVector.fromArray(F256, right, i);
sum = l.fma(r, sum);
}
return sum.reduceLanes(ADD);
}
@Benchmark
public float vectorfmaUnrolled() {
var sum1 = FloatVector.zero(F256);
var sum2 = FloatVector.zero(F256);
var sum3 = FloatVector.zero(F256);
var sum4 = FloatVector.zero(F256);
int width = F256.length();
for (int i = 0; i < size; i += width * 4) {
sum1 = FloatVector.fromArray(F256, left, i).fma(FloatVector.fromArray(F256, right, i), sum1);
sum2 = FloatVector.fromArray(F256, left, i + width).fma(FloatVector.fromArray(F256, right, i + width), sum2);
sum3 = FloatVector.fromArray(F256, left, i + width * 2).fma(FloatVector.fromArray(F256, right, i + width * 2), sum3);
sum4 = FloatVector.fromArray(F256, left, i + width * 3).fma(FloatVector.fromArray(F256, right, i + width * 3), sum4);
}
return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(ADD);
}
@Benchmark
public float vector() {
var sum = FloatVector.zero(F256);
for (int i = 0; i < size; i += F256.length()) {
var l = FloatVector.fromArray(F256, left, i);
var r = FloatVector.fromArray(F256, right, i);
sum = l.mul(r).add(sum);
}
return sum.reduceLanes(ADD);
}
@Benchmark
public float vectorUnrolled() {
var sum1 = FloatVector.zero(F256);
var sum2 = FloatVector.zero(F256);
var sum3 = FloatVector.zero(F256);
var sum4 = FloatVector.zero(F256);
int width = F256.length();
for (int i = 0; i < size; i += width * 4) {
sum1 = FloatVector.fromArray(F256, left, i).mul(FloatVector.fromArray(F256, right, i)).add(sum1);
sum2 = FloatVector.fromArray(F256, left, i + width).mul(FloatVector.fromArray(F256, right, i + width)).add(sum2);
sum3 = FloatVector.fromArray(F256, left, i + width * 2).mul(FloatVector.fromArray(F256, right, i + width * 2)).add(sum3);
sum4 = FloatVector.fromArray(F256, left, i + width * 3).mul(FloatVector.fromArray(F256, right, i + width * 3)).add(sum4);
}
return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(ADD);
}
@Benchmark
public float unrolled() {
float s0 = 0f;
float s1 = 0f;
float s2 = 0f;
float s3 = 0f;
float s4 = 0f;
float s5 = 0f;
float s6 = 0f;
float s7 = 0f;
for (int i = 0; i < size; i += 8) {
s0 = Math.fma(left[i + 0], right[i + 0], s0);
s1 = Math.fma(left[i + 1], right[i + 1], s1);
s2 = Math.fma(left[i + 2], right[i + 2], s2);
s3 = Math.fma(left[i + 3], right[i + 3], s3);
s4 = Math.fma(left[i + 4], right[i + 4], s4);
s5 = Math.fma(left[i + 5], right[i + 5], s5);
s6 = Math.fma(left[i + 6], right[i + 6], s6);
s7 = Math.fma(left[i + 7], right[i + 7], s7);
}
return s0 + s1 + s2 + s3 + s4 + s5 + s6 + s7;
}
@Benchmark
public float vanilla() {
float sum = 0f;
for (int i = 0; i < size; ++i) {
sum = Math.fma(left[i], right[i], sum);
}
return sum;
}
Process followed to compile and use the OpenJDK Panama dev vectorIntrinsic branch as showed in this SO question
hg clone http://hg.openjdk.java.net/panama/dev/
cd dev/
hg checkout vectorIntrinsics
hg branch vectorIntrinsics
bash configure
make images
Things I checked why it should have worked.
- lscpu shows all sorts of avx flags.
- I chose HVM AMI which should support AVX instructions sets. https://aws.amazon.com/ec2/instance-types/ says : † AVX, AVX2, and Enhanced Networking are only available on instances launched with HVM AMIs.
- I can compile vector code which means I am using appropriate branch of OpenJDK. I ran my code with --add-modules=jdk.incubator.vector VM parameter. I also added some other VM params like state in [this intel blog](https://software.intel.com/en-us/articles/vector-api-developer-program-for-java) : -XX:TypeProfileLevel=121
- I checked the compiled assembly code it does contain vmulps instructions. Although it was difficult to find them as I am calling methods in vector api code and multiplication is happening at some other inside the called mul/fma method.
- I have done lot more testing with different SIZE like 64, 128, 256, 512 and also using the ``FloatVector.SPECIES_PREFERRED``. In all cases vector api code is significantly slower then the simple multiplication code with unrolling.