0

I'm learning how to use SIMD instructions using C and I want to compare code using SIMD to code without. Does anyone have a testing template that accurately identifies the expected speedup of SIMD code versus standard code?

Specifically I've noticed the following approximate performance times with specific configurations:

SIMD first, single run:
SIMD: 0.15 s
standard: 0.35 s

SIMD first, standard second, repeated 10x:
SIMD: 0.15 s - first run, 0.05 s on subsequent runs
standard: 0.35 s - first run, 0.34 on subsequent runs

standard first, SIMD second, repeated 10x:
standard: 0.45 s - first run, 0.35 s on subsequent runs
SIMD: 0.05 s - first run, 0.05 s on subsequent runs

The code example is running through a dataset of 1e8 values of type uint16_t. Data allocation and initialization is outside the loop. If I allocate the data inside the repetition loop the loops all have the same timing. If I do this before the SIMD and standard sections, rather than just before whichever comes first, I get the larger times for both:

standard: 0.45 s
SIMD: 0.15 s

So why is data allocation causing such time differences? What is the real speedup?

Link to code: https://gist.github.com/JimHokanson/55ce2e5cac75d7df6dc24dadf383e68f

I'm testing on a Early 2016 Macbook with a m3 processor ...

Jimbo
  • 2,886
  • 2
  • 29
  • 45
  • So, it turns out most of my issue was with the use of calloc! I'm not sure of the source but I'm pretty sure I've seen somewhere that the operating system can do some pretty fancy things with calloc. If I run through the loop and assign all values to 0 the difference in timing goes back to 0.35s for the standard approach and 0.05 seconds for the SIMD approach. I think this is representative of the real world use case where my array has been initialized (completely, i.e. every value explicitly set). – Jimbo Nov 11 '17 at 03:28
  • **Update:** This also occurs for malloc. So is this something with the operating system or some strange caching effect with the processor? – Jimbo Nov 11 '17 at 03:28
  • 1
    Without seeing how you're testing all I can do is guess, but it sounds like you were testing your allocation code as well as whatever operation you were trying to speed up. It's also possible that even if you weren't, the memory you allocated had not been touched so accessing it caused a page fault as it was swapped in or mapped. It would be best to include your actual testing code for a real answer. – Retired Ninja Nov 11 '17 at 03:32
  • I posted a gist online. I increased the memory allocation even higher (1e9 samples) and the same trend holds. – Jimbo Nov 11 '17 at 04:03
  • 2
    If malloc or calloc and only read it, all the pages can be copy-on-write mapped to the same physical zero page, so you get L1D cache hits. Your question is not very clear about your access patterns. Or about what hardware you're testing on. – Peter Cordes Nov 11 '17 at 04:03

1 Answers1

0

So it appears that the issue is likely just a failure to actually initialize the memory as expected. I had thought it might be something specific to SIMD testing, not just C in general.

So the proper approach for proper memory initialization is something like the following:

data = malloc(1e8);
//- Do a loop to initialize data (previously memset to 0 but it was suggested that this may be optimized away)
//- Do SIMD comparison vs standard approach - loop and average results

Optimization Settings: In addition, remember to enable optimizations when trying to compete against std lib assembly code! See: why is strchr twice as fast as my simd code The basic gist is that I was comparing SIMD to standard library code with very nicely optimized assembly. Without optimization the SIMD code was too slow but after optimization the results were more reasonable.

Over Optimization: Sometimes the compiler optimizes away code in one case and not in the other. For example I had the following code:

for (size_t n2 = 0; n2 < n_loops_inner; n2++){
   str2 = memchr(str,'b',N);
   char_index2 = str2 - str;
}

However this code executed way too quickly. I added the following line prior to the search inside the loop.

  str[(size_t)char_position] = 'b';

Additionally, I also marked char_index2 as volatile. Together those changes provided a more reasonable execution time. (i.e. 1000x slower than without these changes)

Jimbo
  • 2,886
  • 2
  • 29
  • 45
  • gcc will optimize `malloc` + `memset(0)` into `calloc`. If you're doing integer stuff, `memset` with something other than zero. – Peter Cordes Nov 11 '17 at 17:04