How to get performance boost from AVX intrinsics for calculating basic statistics?

Question

My question is about performance of using AVX instructions versus a naive approach.

I am getting the same — and correct — answer from my AVX approach as I am getting from my naive approach, but the answer takes slightly longer with AVX instructions, so I'd like to know what I'm doing wrong/inefficiently with vectorized code.

This question is a bit too complex to offer a self-contained compilable unit of code, and I'm sorry about that. However, I have functional code snippets below that I hope are fairly straightforward and decently styled, which are hopefully easy enough to follow to deal with the question at hand.

Some environment details:

Both approaches are compiled with Clang (Apple LLVM version 8.1.0 (clang-802.0.42)).
I am compiling with the -mavx flag.
My hardware (MacBook Pro with an Intel Core i7 processor) claims to support AVX instructions.

I have a program where a user provides a multiline text file, each line containing a comma-separated string of numbers, i.e., a list of n-dimensional vectors, where n is arbitrary for the file, but (barring bad input) is the same value n for every line.

For example:

0,4,6,1,2,22,0,2,30,...,39,14,0,3,3,3,1,3,0,3,2,1
0,0,1,1,0,0,0,8,0,1,...,6,0,0,4,0,0,0,0,7,0,8,2,0
...
1,0,1,0,1,0,0,2,0,1,...,2,0,0,0,0,0,2,1,1,0,2,0,0

I generate some statistical scores from comparisons of these vectors, e.g., Pearson correlation, but the score function could be anything, say, something simple, like an arithmetic mean.

Naive approach

Each of these vectors is put into a pointer to a struct called signal_t:

typedef struct signal {
    uint32_t n;
    score_t* data;
    score_t mean;
} signal_t;

The score_t type is just a typedef to float:

typedef float score_t;

To start, I parse the string into float (score_t) values and calculate the arithmetic mean:

signal_t* s = NULL;
s = malloc(sizeof(signal_t));
if (!s) {
    fprintf(stderr, "Error: Could not allocate space for signal pointer!\n");
    exit(EXIT_FAILURE);
}
s->n = 1;
s->data = NULL;
s->mean = NAN;

for (uint32_t idx = 0; idx < strlen(vector_string); idx++) {
    if (vector_string[idx] == ',') {
        s->n++;
    }
}

s->data = malloc(sizeof(*s->data) * s->n);
if (!s->data) {
    fprintf(stderr, "Error: Could not allocate space for signal data pointer!\n");
    exit(EXIT_FAILURE);
}
char* start = vector_string;
char* end = vector_string;
char entry_buf[ENTRY_MAX_LEN];
uint32_t entry_idx = 0;
bool finished_parsing = false;
bool data_contains_nan = false;
do {
    end = strchr(start, ',');
    if (!end) {
        end = vector_string + strlen(vector_string);
        finished_parsing = true;
    }
    memcpy(entry_buf, start, end - start);
    entry_buf[end - start] = '\0';
    sscanf(entry_buf, "%f", &s->data[entry_idx++]);
    if (isnan(s->data[entry_idx - 1])) {
        data_contains_nan = true;
    }
    start = end + 1;
} while (!finished_parsing);

if (!data_contains_nan) {
    s->mean = pt_mean_signal(s->data, s->n);
}

The arithmetic mean is pretty straightforward:

score_t pt_mean_signal(score_t* d, uint32_t len)
{
    score_t s = 0.0f;
    for (uint32_t idx = 0; idx < len; idx++) {
        s += d[idx];
    }
    return s / len;
}

Naive performance

On running this kind of approach on a file of 10k vector-strings, I get a runtime of 6.58s.

AVX approach

I have a modified signal_t struct called signal_avx_t:

typedef struct signal_avx {
    uint32_t n_raw;
    uint32_t n;
    __m256* data;
    score_t mean;
} signal_avx_t;

This stores pointers to __m256 addresses. Each __m256 stores eight single-precision float values. For convenience, I define a constant called AVX_FLOAT_N to store this multiple, e.g.:

#define AVX_FLOAT_N 8

Here is how I am parsing the vector-string and storing it in a __m256. It is very similar to the naive approach, except now I read values eight at a time into a buffer, write the buffer to an __m256, and repeat until there are no more values to write. I then calculate the mean:

signal_avx_t* s = NULL;
s = malloc(sizeof(signal_avx_t));
if (!s) {
fprintf(stderr, "Error: Could not allocate space for signal_avx pointer!\n");
    exit(EXIT_FAILURE);
}
s->n_raw = 1;
s->n = 0;
s->data = NULL;
s->mean = NAN;

for (uint32_t idx = 0; idx < strlen(vector_string); idx++) {
    if (vector_string[idx] == ',') {
        s->n_raw++;
    }
}

score_t signal_buf[AVX_FLOAT_N];

s->n = (uint32_t) ceil((float)(s->n_raw) / AVX_FLOAT_N);
s->data = malloc(sizeof(*s->data) * s->n);
if (!s->data) {
    fprintf(stderr, "Error: Could not allocate space for signal_avx data pointer!\n");
    exit(EXIT_FAILURE);
}
char* start = id;
char* end = id;
char entry_buf[ENTRY_MAX_LEN];
uint32_t entry_idx = 0;
uint32_t data_idx = 0;
bool finished_parsing = false;
bool data_contains_nan = false;

do {
    end = strchr(start, ',');
    if (!end) {
        end = vector_string + strlen(vector_string);
        finished_parsing = true;
    }
    memcpy(entry_buf, start, end - start);
    entry_buf[end - start] = '\0';
    sscanf(entry_buf, "%f", &signal_buf[entry_idx++ % AVX_FLOAT_N]);
    if (isnan(signal_buf[(entry_idx - 1) % AVX_FLOAT_N])) {
        data_contains_nan = true;
    }
    start = end + 1;

    /* I write every eight floats to an __m256 chunk of memory */
    if (entry_idx % AVX_FLOAT_N == 0) {
        s->data[data_idx++] = _mm256_setr_ps(signal_buf[0],
                                             signal_buf[1],
                                             signal_buf[2],
                                             signal_buf[3],
                                             signal_buf[4],
                                             signal_buf[5],
                                             signal_buf[6],
                                             signal_buf[7]);
    }
} while (!finished_parsing);

if (!data_contains_nan) {
    /* write any leftover floats to the last `__m256` */
    if (entry_idx % AVX_FLOAT_N != 0) {
        for (uint32_t idx = entry_idx % AVX_FLOAT_N; idx < AVX_FLOAT_N; idx++) {
            signal_buf[idx] = 0;
        }
        s->data[data_idx++] = _mm256_setr_ps(signal_buf[0],
                                             signal_buf[1],
                                             signal_buf[2],
                                             signal_buf[3],
                                             signal_buf[4],
                                             signal_buf[5],
                                             signal_buf[6],
                                             signal_buf[7]);
    }
    s->mean = pt_mean_signal_avx(s->data, s->n, s->n_raw);
}

AVX mean function

Here is the function I wrote to generate the arithmetic mean:

score_t pt_mean_signal_avx(__m256* d, uint32_t len, uint32_t len_raw)
{
    score_t s = 0.0f;
    /* initialize a zero-value vector to collect summed value */
    __m256 v_sum = _mm256_setzero_ps();
    /* add data to collector */
    for (uint32_t idx = 0; idx < len; idx++) {
        v_sum = _mm256_add_ps(v_sum, d[idx]);
    }
    /* sum the collector values */
    score_t* res = (score_t*)&v_sum;
    for (uint32_t idx = 0; idx < AVX_FLOAT_N; idx++) {
        s += res[idx];
    }
    return s / len_raw;
}

AVX performance

On running the AVX-based approach on a file of 10k vector-strings, I get a runtime of 6.86s, about 5% slower. This difference is roughly constant, regardless of the size of the input.

Summary

My expectation was that, by using AVX instructions and by vectorizing loops, I'd get a speed bump, not that the performance would be marginally worse.

Is there anything in the code snippets that suggests misuse of the __m256 datatype and related intrinsic functions, for the purposes of calculating a basic summary statistic?

Mainly, I'd like to figure out what I'm doing wrong here, before progressing to more complicated scoring functions between larger datasets. Thanks for any constructive advice!

I know this is a hard thing to do, but if you prepare an actually compileable piece of C including a `main()`, I'd be happy to run a bit of profiling and see where time is actually spent. — Marcus Müller, Jun 26 '17 at 22:28
Hi Marcus, thanks your offer to review code. I posted a Github project here that is compilable under CentOS 7 (gcc 4.8.5) and Mac OS X 10.12 (clang via Xcode). You could just clone it and then `make all` to run the tests: https://github.com/alexpreynolds/simd-pearson-test — Alex Reynolds, Jun 27 '17 at 00:18
Clang probably auto-vectorized your scalar version, at least if you used `-ffast-math` and `-O2` or higher. It may have done a better job than your manually-vectorized code. (e.g. clang normally unrolls loops using multiple vector accumulators, which in this case hides the latency of FP add). Look at the asm for the hot loop for both versions. — Peter Cordes, Jun 30 '17 at 04:32

Marcus Müller · Answer 1 · 2017-06-26T21:54:22.847

First of all, I hope we agree that text-parsing into floats is probably a lot more CPU-intense than the arithmetic mean, not even mentioning the reading of data from a file on physical storage. If you're going to do a benchmark, you should definitely omit the reading and parsing.

What seems to be the main problem here is that you're trying to be and vectorize while reading. What you're doing in reality is an unnecessary copy of your data from signal_buf to s.

You've got to realize that __mm256_* isn't really a memory data type. It just is a macro to make sure the memory addresses and registers you're using a 256bit-value capable.

So, just take your signal_buf and __mm256_load_ps load that into a SIMD register, and then do your AVX magic on it, or just sequentially fill s directly wit sscanf and then do the same __mm256_load_ps magic.

I really don't see why you're using setr. Why would you need to reverse the order of elements for an arithmetic mean? Or was this your "poor man's load instruction"?

Again, your floating point math efforts, especially if you write code that your compiler might even be able to auto-vectorize, is not what's costing time here. It's the parsing of strings.

VOLK (Vector Optimized Library of Kernels) has a lot of hand-written SIMD kernels, including one that accumulates arrays of floats:

https://github.com/gnuradio/volk/blob/master/kernels/volk/volk_32f_accumulator_s32f.h

The AVX code looks like this:

static inline void
volk_32f_accumulator_s32f_a_avx(float* result, const float* inputBuffer, unsigned int num_points)
{
  float returnValue = 0;
  unsigned int number = 0;
  const unsigned int eighthPoints = num_points / 8;

  const float* aPtr = inputBuffer;
  __VOLK_ATTR_ALIGNED(32) float tempBuffer[8];

  __m256 accumulator = _mm256_setzero_ps();
  __m256 aVal = _mm256_setzero_ps();

  for(;number < eighthPoints; number++){
    aVal = _mm256_load_ps(aPtr);
    accumulator = _mm256_add_ps(accumulator, aVal);
    aPtr += 8;
  }

  _mm256_store_ps(tempBuffer, accumulator);

  returnValue = tempBuffer[0];
  returnValue += tempBuffer[1];
  returnValue += tempBuffer[2];
  returnValue += tempBuffer[3];
  returnValue += tempBuffer[4];
  returnValue += tempBuffer[5];
  returnValue += tempBuffer[6];
  returnValue += tempBuffer[7];

  number = eighthPoints * 8;
  for(;number < num_points; number++){
    returnValue += (*aPtr++);
  }
  *result = returnValue;
}

What it does is having an eight element accumulator, to which it continously adds sets of eight new elements (separately), and then, in the end, returns the sum about these eight accumulators.

Correlation or other parametric pairwise scoring functions can depend on elements in correct order to get the correct score value, hence the use of setr to order floats correctly. — Alex Reynolds, Jun 26 '17 at 21:58
@AlexReynolds but again, in that case, do the shuffling while you're doing non-aligned/cache intense stuff (parsing and then storing the values in an array), not in an extraneous shuffle. The trick is always to stay as linear in memory as possible – and do all the shuffling in one place, if inevitable. — Marcus Müller, Jun 26 '17 at 22:27
Hi Marcus, I added a comment to your comment to my question. I'm entirely new to programming with these intrinsics, so beyond my reading online, I appreciate any advice you might have about the code. — Alex Reynolds, Jun 27 '17 at 00:24
@AlexReynolds: You appear to be using `_mm256_setr_ps(d[0], d[1], ...)` to create a vector where the elements are in the same order they are in memory. This is exactly what you get from `_mm256_load_ps(d)`. If you're lucky, the compiler would see through the `setr` and still emit a single load instruction. (Marcus: many people use `setr` because it takes args in "little endian" order, with the first arg going to element 0, etc., just like an array initializer. `_mm_set*` takes args from high to low element, which matches how left/right shifts move data within a vector.) — Peter Cordes, Jun 30 '17 at 04:40
If Alex's data is really comma-separated integers of limited length, parsing it with SIMD instead of `scanf` or `atof()` function calls is a possibility and could give big speedups. See the end of my answer, but it's really complicated and definitely not a good first project for a SIMD novice. — Peter Cordes, Jun 30 '17 at 05:36
Also, the horizontal sum part of your answer would be more efficient if you did it with shuffles instead of storing to an array. My [SSE horizontal-sum answer](https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86) has an AVX section. — Peter Cordes, Jun 30 '17 at 05:38

Peter Cordes · Answer 2 · 2017-06-30T05:34:32.337

There's a lot of inefficiency outside of the vectorized part (which @Marcus addressed in his answer).

Don't dynamically allocate space for signal_t* s. It's a very small fixed-size struct and you only need one of it, so you should just use signal_t s (automatic storage) and remove a level of indirection.

It would probably also be better not to scan the whole string for , before doing any conversion, since if the string doesn't fit in L1 cache (32k) then you're losing out on data reuse when you convert it.

If you can't just sum it on the fly (like for mean), then allocate a big buffer to put converted data in. If you get to the end of the string without filling the buffer, then that's good. realloc it down to size. (Memory pages you allocated but never touched are basically free on most OSes.) If you fill up your initial buffer size before getting to the end of the string, grow it by a factor of two with realloc. (Exponential size increases amortize to O(1) average cost for appending an element, which is why C++'s std::vector works this way. So do hash tables.)

If you know the full length of the string (e.g. from the file size), you can use that to produce an estimate of how big a buffer you'll need. (e.g. assume that each float is 2 bytes long, including the ',', since they'll be at least that long and it's fine to over-allocate within reason.)

If you really want to count commas before conversion, you could vectorize it _mm_cmpeq_epi8 to find commas in a vector of string data, and _mm_add_epi8 to sum those vectors of 0/-1. Use _mm_sad_epu8 for horizontal sums of the 8 bit elements into 64-bit elements at least every 255 vectors to avoid overflow.

If your data is like your simples example, where every number is really a 1-digit integer, you can do much better than converting them to float with scanf. e.g. you can use integer SIMD to turn ASCII digits into integers from 0 to 9.

// if we don't know the string length ahead of time,
// we could look for a '\0' on the fly with _mm256_cmpeq_epi8 / _mm256_movemask_epi
uint64_t digitstring_sum(const char*p, size_t len)
{
    const char *endp = p+len - 31;  // up to the last full-vector of string data

    __m256i sum = _mm256_setzero_si256();
    for ( ; p < endp ; p+=32 ) {
        __m256i stringdata = _mm256_loadu_si256((const __m256i*)p);
        __m256i integers = _mm256_sub_epi16(stringdata, _mm256_set1_epi16( '0'+(','<<8) ));  // turn "1,2,3,..." into 0x0100 0200 0300...

        // horizontal sum the 8-bit elements into 64-bit elements
        // There are various ways to optimize this by doing this part less frequently, but still often enough to avoid overflow.  Or doing it a different way.
        __m256i hsum = _mm256_sad_epu8(integers, _mm256_setzero_si256());  // sum(x[0] - 0, x[1] - 0, ...) = sum (x[...])

        sum = _mm256_add_epi64(sum, hsum);
    }
    // sum holds 4x 64-bit accumulators.  Horizontal sum that:
    // (this is probably more efficient than storing to memory and looping, but just barely for a vector of only 4 elements)
    __m128i hi = _mm256_extract_si128(sum, 1);
    __m128i lo = _mm256_castsi256_si128(sum);
    __m128i t1 = _mm_add_epi64(lo, hi);
    __m128i t2 = _mm_unpackhi_epi64(t1,t1);  // copy high 64 bit element to low
    __m128i t3 = _mm_add_epi64(t1, t2);
    uint64_t scalar_sum = _mm_cvtsi128_si32(t3);

    // Then a cleanup loop to handle the last partial-vector of string data.
    // or do it with an unaligned vector and some masking...

    return scalar_sum;
}

Some of your examples have multi-digit numbers, but still only integers. You can parse them into vectors of integers in groups of 4 by using a vector compare to find the positions of the commas, and using that bitmap as an integer index into a lookup-table of shuffle-control vectors.

This gets really complicated, but see @stgatilov's answer about parsing a dotted-quad IPv4 address string into an 32-bit integer using this technique. Since pshufb (_mm_shuffle_epi8) operates in two separate lanes, you're probably best only using 128-bit vectors.

You'd want to do that in a loop, and move through the string 4 integers at a time. Given a compare-mask result as an integer, you can find the position of the 5th comma by stripping off the leading 4 set bits, and then using a bit-scan instruction / intrinsic. Stripping off the first 4 set bits could be done using BMI1 _blsr_u32 4 times. (It does dst = (a - 1) & a with a single instruction).

Or since you need a LUT for shuffle-control vectors anyway, you could use some of the don't-care bits in the LUT entries to hold a byte-count.

How to get performance boost from AVX intrinsics for calculating basic statistics?

2 Answers2