AVX2 1GB long array

Question

I have a 1gb long array with floats in a .bin file. After i read it how can i sum the elements with avx2 instrucion, and print the result?

I edited my code with Jake 'Alquimista' LEE's answer. The problem is the result much smaller than it will be. And other question, how can i add a constant to each number that i readed from .bin file?

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>

inline float sumf(const float *pSrc, uint32_t len)
{
    __m256 sum, in;
    float sumr;
    uint32_t sumi;
    uint32_t lenr = len & 7;
    while (len--)
    len >>= 3;
    sum = _mm256_set1_ps(0.0f);
    {
        in = _mm256_loadu_ps(pSrc++);
        sum = _mm256_add_ps(in, sum);
    }

    sum = _mm256_hadd_ps(sum, in);
    sum = _mm256_hadd_ps(sum, in);
    sum = _mm256_hadd_ps(sum, in);
    sumi = _mm256_extract_epi32(*(__m256i *)&sum, 0);
    sumr = *(float *)&sumi;

    while (lenr--)
    {
        sumr += *pSrc++;
    }

    return sumr;
}


int main(void)
{

        FILE *file;

        float *buffer2;
        uint32_t fileLen;

        if((file = fopen("example.bin","rb"))==NULL)
        {
                printf("Error! opening file");
                exit(1);
        }


        fseek(file, 0, SEEK_END);
        fileLen=ftell(file);
        fseek(file, 0, SEEK_SET);
    buffer2=(float *)malloc(fileLen+1);
        if (!buffer2)
        {
                fprintf(stderr, "Memory error!");
                                fclose(file);
                return 0;
        }


        fread(buffer2, fileLen, 1, file);
        fclose(file);
        printf( "File size : %lu Bits \n", fileLen );
        for(int i = 0; i<10; i++)
        printf("%f \n", buffer2[i]);

    float sum =sumf(buffer2,fileLen);
        printf("%f\n",s);
        free(buffer2);
        return 0;
}

You don't need AVX2 for that. IO will be a bottleneck for you. Just pipeline IO and simple summation code and you'll be fine. — Elalfer, Nov 04 '17 at 00:36
Consider reading the file as chunks. Grouping floats as pairs of `__m256` floats (Probably using casts or using vector load intrinsics) then performing addition with `_mm256_add_ps(a, b)` on the two vectors. I could help with a piece of code later. — Karim Manaouil, Nov 04 '17 at 01:00
What's the point in adding a constant to each element in the array? You could just add `const * len` to the final result. — Jake 'Alquimista' LEE, Nov 04 '17 at 16:17
By the way, you should have left your original question as it was. — Jake 'Alquimista' LEE, Nov 04 '17 at 16:18
@RafaNadal95 I have added an answer. Check it out and tell me what was the result. — Karim Manaouil, Nov 04 '17 at 23:08

Jake 'Alquimista' LEE · Answer 1 · 2017-11-07T07:55:00.373

0

inline float sumf(const float *pSrc, uint32_t len)
{
    __m256 sum, in;
    float sumr;
    uint32_t sumi;
    uint32_t lenr = len & 7;
    len >>= 3;
    sum = _mm256_set1_ps(0.0f);
    while (len--)
    {
        in = _mm256_loadu_ps(pSrc++);
        sum = _mm256_add_ps(in, sum);
    }
    in =  *(__m256 *)&_mm256_permute4x64_pd(*(__m256d *)&sum, 0b01001110);
    sum = _mm256_hadd_ps(sum, in);
    sum = _mm256_hadd_ps(sum, in);
    sum = _mm256_hadd_ps(sum, in);
    sumi = _mm256_extract_epi32(*(__m256i *)&sum, 0);
    sumr = *(float *)&sumi;

    while (lenr--)
    {
        sumr += *pSrc++;
    }

    return sumr;
}

The function above will do. However, I don't think that it will bring much of a performance gain, if any, since it's a very trivial one, and the compiler will do auto-vectorize it anyway.

Please note that you have to typecast the pointer to float *, and divide filelen by sizeof(float) when you pass them as arguments.

edited Nov 07 '17 at 07:55

answered Nov 04 '17 at 08:50

Jake 'Alquimista' LEE

6,197
2
17
25

Actually compilers can't generally auto-vectorize the typical sum loop, because floating point addition is not commutative, so vectorization would produce a different result than the scalar loop. You can use `-Ofast` to tell the compiler "I don't care, vectorize it anways" and then [it will](https://godbolt.org/g/JDtMcj). – BeeOnRope Nov 04 '17 at 09:13
sum = _mm256_hadd_ps(sum, in); Why is this three time? – RafaNadal95 Nov 04 '17 at 13:02
@RafaNadal95 You are about to add eight floats. You need to add the pairs three times (8 - 4 - 2 - 1) and then extract the lowest element. – Jake 'Alquimista' LEE Nov 04 '17 at 13:41
@Jake 'Alquimista' LEE I edited my question with you help. I hope you can help again. – RafaNadal95 Nov 04 '17 at 15:44
Your horizontal sum is not quite right - look carefully at how [`_mm256_hadd_ps`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=mm256_hadd_ps&expand=2776) works. Like a lot of AVX instructions it operates on two 128 bit lanes rather than being a full 256 bit operation. You probably just want to invoke it twice and then extract elements 0 and 4. – Paul R Nov 06 '17 at 22:06
@PaulR Thank you for pointing it out. Now I remember being riddled by Intel's instruction design while looking at the reference sheet. I'll amend it accordingly. (adding a permutation in prior to the summing sequence) – Jake 'Alquimista' LEE Nov 07 '17 at 07:45
No problem - it’s a major PITA that the vast majority of AVX instructions are really just 2 x 128 bit instructions stitched together. See [this answer](https://stackoverflow.com/a/23189942/253056) for one way to do a full 256 bit hadd_ps. – Paul R Nov 07 '17 at 07:56
@PaulR In addition, it's so annoying to beg the compiler for generating the machine codes I want. That's why I prefer assembly over any compiler on ARM, especially when I write NEON codes. – Jake 'Alquimista' LEE Nov 07 '17 at 08:01

score 0 · Answer 2 · answered Nov 04 '17 at 15:48

0

Here's (most likely) your bug:

while (len--)
len >>= 3;

That's a while loop. As long as len != 0, you replace len with (len - 1) >> 3. And then you change it to -1. No loop to be seen.

answered Nov 04 '17 at 15:48

gnasher729

51,477
5
75
98

Oops, how did this happen? I must have pasted at a wrong position. Thanks pointing it out. – Jake 'Alquimista' LEE Nov 04 '17 at 16:15

score 0 · Accepted Answer · answered Nov 04 '17 at 23:06

Reading 1GB file into memory is big memory and I/O overhead. Although I'm not very familiar with AVX2, i read articles from Internet & i could come up with the following solution which is actually tested and proved to be working.

My solution consists of reading the file as chuncks of 512 Bytes (Blocks of 128 floats) then summing up the pairs of vectors (16 Total vectors per block) so that at the end we get a final __m256 vector, by casting it to a float* we could sum up its individual components to get the final result.

A case where the file is not 128-floats aligned is handled in the last for loop by summing up individual floats.

The code is commented but in case you have any suggestions to add more explanation to the answer then feel free to do so.

#include <immintrin.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>

int     make_floatf(char *, int);
float   avx_sfadd(char*);

char error_buf[1024];

#define PERROR()                            \
    do {                                    \
        strerror_r(errno, error_buf, 1024); \
        printf("Error: %s\n", error_buf);   \
        fclose(fp);                         \
        return -1;                          \
    } while(0)

/* This function generates a .bin file containing blocks 
 *   of 128 floating point numbers
 */
int make_floatf(char *filename, int nblocks)
{
    FILE *fp = NULL;

    if(!(fp = fopen(filename, "wb+")))
        PERROR();

    float *block_ptr = malloc(sizeof(float) * 128);  /* 512 Bytes block of 128 floats */
    if(!block_ptr)
        PERROR();

    int j, i;

    for(j = 0; j < nblocks; j++)
    {
        for(i = 0; i < 128; i++)
            block_ptr[i] = 1.0;

        int ret = fwrite(block_ptr, sizeof(float), 128, fp);
        if(ret < 128)
        {
            free(block_ptr);
            PERROR();
        }
    }

    free(block_ptr);
    fclose(fp); 

    return 0;
}

/* This function reads the .bin file as chuncks of 512B 
 * blocks (128 floating point numbers) and calculates thier sum.
 * The final sum in a form of vector is looped through and its 
 * components are summed up to get the final result.
 */
float avx_sfadd(char *filename)
{
    FILE *fp = NULL;

    __m256  v1;
    __m256  v2;
    __m256  sum = _mm256_setzero_ps();

    if(!(fp = fopen(filename, "rb")))
       PERROR();

    struct stat stat_buf;
    stat(filename, &stat_buf);

    size_t fsize     = stat_buf.st_size;
    size_t nblocks   = fsize / (sizeof(float) * 128); 
    size_t rem_size  = fsize - nblocks * sizeof(float) * 128;
    size_t rem_floats = rem_size / (sizeof(float));

    printf("File size: %ld\nnblocks:%ld\nnremfloats: %ld\n",\
            fsize, nblocks, rem_floats); 

    /* This memory area will hold the 128 floating point numbers per block */
    float *block_ptr = malloc(sizeof(float) * 128);
    if(!block_ptr)
        PERROR();

    int i;
    for(i = 0; i < nblocks; i++)
    {
        int ret = fread(block_ptr, sizeof(float), 128, fp);
        if(ret < 128)
            PERROR();   

        /* Summing up vectors in a block of 16 vectors (128 floats) */
        int j;
        for(j = 0; j < 16; j += 2)
        {
            v1 = _mm256_loadu_ps(block_ptr + j*8);
            v2 = _mm256_loadu_ps(block_ptr + (j+1)*8);

            sum += _mm256_add_ps(v1, v2);
        } 
    }

    /* Handling the case if the last chunck of the file doesn't make 
     * a complete block.
     */
    float rem_sum = 0;
    if(rem_size > 0)
    {
        int ret = fread(block_ptr, 1, rem_size, fp);
        if(ret < rem_floats)
            PERROR();

        int j;
        for(j = 0; j < rem_floats; j++)
            rem_sum += block_ptr[j];
    }

    float final_sum = rem_sum;
    float *sum_ptr = (float*)&sum; /* The final vector hold the sum of all vectors */

    /* Summing up the values of the last vector to get the final result */
    int k;
    for(k = 0; k < 8; k++)
        final_sum += sum_ptr[k];

    free(block_ptr);
    fclose(fp);

    return final_sum;
}


int main(int argc, char **argv)
{
    if(argc < 2){
        puts("./main filename [nblocks]");
        return 0;
    }

    /* ./main filename number_of_block_to_create (eg. ./main floats.bin 1024 )*/
    else if(argc == 3){

        if(!make_floatf(argv[1], atoi(argv[2])))
            puts("File has been created sucessfully\n");
    }

    /* ./main filename (eg. ./main floats.bin) to calculate sum*/
    else 
        printf("avx_sum = %f\n", avx_sfadd(argv[1])) :


    return 0;
}

AVX2 1GB long array

3 Answers3