124

Assuming that we have a T myarray[100] with T = int, unsigned int, long long int or unsigned long long int, what is the fastest way to reset all its content to zero (not only for initialization but to reset the content several times in my program)? Maybe with memset?

Same question for a dynamic array like T *myarray = new T[100].

Matteo Italia
  • 123,740
  • 17
  • 206
  • 299
Vincent
  • 57,703
  • 61
  • 205
  • 388

7 Answers7

214

memset (from <string.h>) is probably the fastest standard way, since it's usually a routine written directly in assembly and optimized by hand.

memset(myarray, 0, sizeof(myarray)); // for automatically-allocated arrays
memset(myarray, 0, N*sizeof(*myarray)); // for heap-allocated arrays, where N is the number of elements

By the way, in C++ the idiomatic way would be to use std::fill (from <algorithm>):

std::fill(myarray, myarray+N, 0);

which may be optimized automatically into a memset; I'm quite sure that it will work as fast as memset for ints, while it may perform slightly worse for smaller types if the optimizer isn't smart enough. Still, when in doubt, profile.

Matteo Italia
  • 123,740
  • 17
  • 206
  • 299
  • 12
    As of the 1999 ISO C standard, it wasn't actually guaranteed that `memset` would set an integer to 0; there was no specific statement that all-bits-zero is a representation of `0`. A Technical Corrigendum added such a guarantee, which is included in the 2011 ISO C standard. I believe that all-bits-zero *is* a valid representation of `0` for all integer types in all existing C and C++ implementations, which is why the committee was able to add that requirement. (There is no similar guarantee for floating-point or pointer types.) – Keith Thompson Jun 16 '13 at 00:17
  • 3
    Adding to @KeithThompson's comment: this guarantee was added to 6.2.6.2/5 in plain text in TC2 (2004) ; however if there are no padding bits then 6.2.6.2/1 and /2 already guaranteed that all-bits-zero was `0`. (With padding bits the possibility exists that all-bits-zero could be a trap representation). But in any case, the TC is supposed to acknowledge and replace defective text, so as of 2004 we should act as if C99 always contained this text. – M.M May 29 '15 at 11:02
  • In C, if you allocated the dynamic array _correctly_, then there will be no difference between the two memsets. Correct dynamic allocation would be `int (*myarray)[N] = malloc(sizeof(*myarray));`. – Lundin Sep 21 '15 at 09:04
  • @Lundin: of course - if you know at compile time how big `N` is, but in the vast majority of cases if you used `malloc` you only knew at runtime. – Matteo Italia Sep 21 '15 at 09:12
  • @MatteoItalia We have had VLAs since the year 1999. – Lundin Sep 21 '15 at 09:15
  • And if you only want to clear a part of that array: `memset(&(myarray[offset]),0,sizeof(myarray)-(sizeof(myarray[0])*offset));` – mbger Feb 28 '17 at 10:54
26

This question, although rather old, needs some benchmarks, as it asks for not the most idiomatic way, or the way that can be written in the fewest number of lines, but the fastest way. And it is silly to answer that question without some actual testing. So I compared four solutions, memset vs. std::fill vs. ZERO of AnT's answer vs a solution I made using AVX intrinsics.

Note that this solution is not generic, it only works on data of 32 or 64 bits. Please comment if this code is doing something incorrect.

#include<immintrin.h>
#define intrin_ZERO(a,n){\
size_t x = 0;\
const size_t inc = 32 / sizeof(*(a));/*size of 256 bit register over size of variable*/\
for (;x < n-inc;x+=inc)\
    _mm256_storeu_ps((float *)((a)+x),_mm256_setzero_ps());\
if(4 == sizeof(*(a))){\
    switch(n-x){\
    case 3:\
        (a)[x] = 0;x++;\
    case 2:\
        _mm_storeu_ps((float *)((a)+x),_mm_setzero_ps());break;\
    case 1:\
        (a)[x] = 0;\
        break;\
    case 0:\
        break;\
    };\
}\
else if(8 == sizeof(*(a))){\
switch(n-x){\
    case 7:\
        (a)[x] = 0;x++;\
    case 6:\
        (a)[x] = 0;x++;\
    case 5:\
        (a)[x] = 0;x++;\
    case 4:\
        _mm_storeu_ps((float *)((a)+x),_mm_setzero_ps());break;\
    case 3:\
        (a)[x] = 0;x++;\
    case 2:\
        ((long long *)(a))[x] = 0;break;\
    case 1:\
        (a)[x] = 0;\
        break;\
    case 0:\
        break;\
};\
}\
}

I will not claim that this is the fastest method, since I am not a low level optimization expert. Rather it is an example of a correct architecture dependent implementation that is faster than memset.

Now, onto the results. I calculated performance for size 100 int and long long arrays, both statically and dynamically allocated, but with the exception of msvc, which did a dead code elimination on static arrays, the results were extremely comparable, so I will show only dynamic array performance. Time markings are ms for 1 million iterations, using time.h's low precision clock function.

clang 3.8 (Using the clang-cl frontend, optimization flags= /OX /arch:AVX /Oi /Ot)

int:
memset:      99
fill:        97
ZERO:        98
intrin_ZERO: 90

long long:
memset:      285
fill:        286
ZERO:        285
intrin_ZERO: 188

gcc 5.1.0 (optimization flags: -O3 -march=native -mtune=native -mavx):

int:
memset:      268
fill:        268
ZERO:        268
intrin_ZERO: 91
long long:
memset:      402
fill:        399
ZERO:        400
intrin_ZERO: 185

msvc 2015 (optimization flags: /OX /arch:AVX /Oi /Ot):

int
memset:      196
fill:        613
ZERO:        221
intrin_ZERO: 95
long long:
memset:      273
fill:        559
ZERO:        376
intrin_ZERO: 188

There is a lot interesting going on here: llvm killing gcc, MSVC's typical spotty optimizations (it does an impressive dead code elimination on static arrays and then has awful performance for fill). Although my implementation is significantly faster, this may only be because it recognizes that bit clearing has much less overhead than any other setting operation.

Clang's implementation merits more looking at, as it is significantly faster. Some additional testing shows that its memset is in fact specialized for zero--non zero memsets for 400 byte array are much slower (~220ms) and are comparable to gcc's. However, the nonzero memsetting with an 800 byte array makes no speed difference, which is probably why in that case, their memset has worse performance than my implementation--the specialization is only for small arrays, and the cuttoff is right around 800 bytes. Also note that gcc 'fill' and 'ZERO' are not optimizing to memset (looking at generated code), gcc is simply generating code with identical performance characteristics.

Conclusion: memset is not really optimized for this task as well as people would pretend it is (otherwise gcc and msvc and llvm's memset would have the same performance). If performance matters then memset should not be a final solution, especially for these awkward medium sized arrays, because it is not specialized for bit clearing, and it is not hand optimized any better than the compiler can do on its own.

Benjamin
  • 441
  • 4
  • 7
  • 4
    A benchmark without code and without mention of the compiler version and the options used? Hmm... – Marc Glisse Sep 21 '15 at 20:57
  • I already had the compiler versions (they were just a little hidden), and just added the applicable options used. – Benjamin Sep 22 '15 at 22:31
  • invalid type argument of unary '*' (have 'size_t {aka unsigned int}')| – Piotr Wasilewicz Jun 25 '17 at 08:08
  • Being so generous as to write your own optimized zeroing method - could you please spare few words on HOW it works, and WHY it is faster? the code is all but self-explanatory. – Motti Shneor Dec 31 '19 at 14:15
  • 3
    @MottiShneor It looks more complicated than it is. An AVX register has a size of 32bytes. So he calculates how many values of `a` fit into a register. Afterwards, he loops over all 32byte blocks, that should be fully overwritten using pointer arithmetics (`(float *)((a)+x)`). The two intrinsics (starting with `_mm256`) just create a zero-initialized 32byte register and store it to the current pointer. This are the first 3 lines. The rest just handles all special cases where the last 32byte block shouldn't be fully overwritten. It is faster due to vectorization. - I hope that helps. – wychmaster Mar 05 '20 at 11:41
14

From memset():

memset(myarray, 0, sizeof(myarray));

You can use sizeof(myarray) if the size of myarray is known at compile-time. Otherwise, if you are using a dynamically-sized array, such as obtained via malloc or new, you will need to keep track of the length.

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
  • 2
    sizeof will work even if the size of the array is not known at compile-time. (of course, only when it's array) – asaelr Feb 05 '12 at 02:28
  • 2
    @asaelr: In C++, `sizeof` is always evaluated at compile-time (and cannot be used with VLAs). In C99, it can be a runtime expression in the case of VLAs. – Ben Voigt Jun 15 '13 at 22:53
  • @BenVoigt Well, the question is about both `c` and `c++`. I commented on Alex`s answer, that says "You can use sizeof(myarray) if the size of myarray is known at compile-time". – asaelr Jun 15 '13 at 23:44
  • 2
    @asaelr: And in C++, he's completely correct. Your comment didn't say anything about C99 or VLAs, so I wanted to clarify it. – Ben Voigt Jun 16 '13 at 00:27
6

You can use memset, but only because our selection of types is restricted to integral types.

In general case in C it makes sense to implement a macro

#define ZERO_ANY(T, a, n) do{\
   T *a_ = (a);\
   size_t n_ = (n);\
   for (; n_ > 0; --n_, ++a_)\
     *a_ = (T) { 0 };\
} while (0)

This will give you C++-like functionality that will let you to "reset to zeros" an array of objects of any type without having to resort to hacks like memset. Basically, this is a C analog of C++ function template, except that you have to specify the type argument explicitly.

On top of that you can build a "template" for non-decayed arrays

#define ARRAY_SIZE(a) (sizeof (a) / sizeof *(a))
#define ZERO_ANY_A(T, a) ZERO_ANY(T, (a), ARRAY_SIZE(a))

In your example it would be applied as

int a[100];

ZERO_ANY(int, a, 100);
// or
ZERO_ANY_A(int, a);

It is also worth noting that specifically for objects of scalar types one can implement a type-independent macro

#define ZERO(a, n) do{\
   size_t i_ = 0, n_ = (n);\
   for (; i_ < n_; ++i_)\
     (a)[i_] = 0;\
} while (0)

and

#define ZERO_A(a) ZERO((a), ARRAY_SIZE(a))

turning the above example into

 int a[100];

 ZERO(a, 100);
 // or
 ZERO_A(a);
AnT stands with Russia
  • 312,472
  • 42
  • 525
  • 765
3

For static declaration I think you could use:

T myarray[100] = {0};

For dynamic declaration I suggest the same way: memset

Drew Noakes
  • 300,895
  • 165
  • 679
  • 742
Bruno Soares
  • 756
  • 5
  • 6
2

zero(myarray); is all you need in C++.

Just add this to a header:

template<typename T, size_t SIZE> inline void zero(T(&arr)[SIZE]){
    memset(arr, 0, SIZE*sizeof(T));
}
Navin
  • 3,681
  • 3
  • 28
  • 52
  • 1
    This is incorrect, it'll clear SIZE bytes. 'memset(arr, 0, SIZE*sizeof(T));' would be correct. – Kornel Kisielewicz May 27 '15 at 14:05
  • @KornelKisielewicz D'oh! I hope nobody copy-pasted this function in the last 1.5 years :( – Navin May 29 '15 at 09:40
  • 1
    hope not, I commented because google brought me here :) – Kornel Kisielewicz Jun 02 '15 at 22:32
  • 1
    Note that this function `zero` is also correct for e.g. `T=char[10]` as could be the case when the `arr` argument is a multidimensional array e.g. `char arr[5][10]`. – mandrake Oct 23 '15 at 12:24
  • @mandrake Are you sure about that? I would expect that `T=char*` in that case :( – Navin Oct 23 '15 at 21:20
  • 1
    Yes, I tested a number of cases with gcc 4.7.3. I find this would be good to note for this answer, as you would otherwise need to have template specializations for each array dimension count. Other answers do not generalize as well, such as the `ARRAY_SIZE` macro, which gives the wrong size if used on a multidimensional array, a better name would perhaps be `ARRAY_DIM_SIZE`. – mandrake Oct 26 '15 at 07:23
  • This is incorrect for float and double as 0.0f and 0.0 is not all zeros. – t0k3n1z3r Nov 20 '18 at 12:48
1

Here's the function I use:

template<typename T>
static void setValue(T arr[], size_t length, const T& val)
{
    std::fill(arr, arr + length, val);
}

template<typename T, size_t N>
static void setValue(T (&arr)[N], const T& val)
{
    std::fill(arr, arr + N, val);
}

You can call it like this:

//fixed arrays
int a[10];
setValue(a, 0);

//dynamic arrays
int *d = new int[length];
setValue(d, length, 0);

Above is more C++11 way than using memset. Also you get compile time error if you use dynamic array with specifying the size.

Shital Shah
  • 63,284
  • 17
  • 238
  • 185