1

My code needs to work with a large array of structures containing multiple strings.

in practice, the whole array will contain about 25k structures with a size of about 256 byte each, so the whole array needs about 6 MiB of heap space.

// example
struct element {
    char foo[32];
    char bar[16];
    ...
}; // sizeof(struct element) = 256

I was concerned about the performance of calloc due to it zeroing all the memory, also I don't need every byte to be initialized. So I did element_arr = malloc(num_elements * sizeof(struct element)).

I allocate the array at runtime as I don't know num_elements at compile time.

For my code to work, I actually only need the first bytes of each member (foo, bar, etc.) to be zero, the rest can stay uninitialized.

Say I got 8 string members per struct, so I only need 3% of my zeroed bytes, the other 97% cleared bytes are waste as they will get overwritten by real data eventually).

I see a few options:

  1. zero everything at once, e.g. with calloc which does (I hope) make use of vectored instructions to write large blocks of aligned zeroes.

  2. memset each 256-byte sized struct element before filling it with real data.

  3. assign 0 to each member of struct element before using it. (*element->foo = 0; ...) This translates to a chain of mov instructions, with optimizations at -O3. It is cumbersome to write language-wise (but can be taken care of).

        mov     byte ptr [rdi + 152], 0
        mov     byte ptr [rdi + 208], 0
        mov     byte ptr [rdi + 200], 0
        mov     byte ptr [rdi + 128], 0
        ...

looks similar for arm64.

  1. make a very conservative assumption about the size of element_arr (e.g. 64 MiB), place it in a zero-initialized section of memory. (The OS needs to zero my memory then)

char element_arr[64 * 1000 * 1000] = {0}; (checking num_elements < 250000 to be sure)

Does it make a difference what option to choose ? What would you suggest ?

Edit: @John Bayko The individual structures are filled incrementally, but all strings need to start with '\0' otherwise the algorithm can't distinguish between a real string (already got filled) or uninitialized garbage.

After reading the other answers I'm convinced that it probably won't be a problem anytime soon. It's good to know that the simplest solution (calloc) is a good one in the majority of use cases.

I profiled my code on my dev machine and indeed, the time spent on allocation is neglectible.

Thanks for your replies.

Kitsune
  • 117
  • 5
  • 1
    Another option is to `memset` the initial portion of each structure. The `offsetof` macro can be used to find the required length. Doing this as each structure is needed may be better than doing it all it once because it may allow the clearing to overlap other work. – Eric Postpischil Aug 22 '23 at 15:14
  • Does every element in the array have to be usable at the start, or are they set up incrementally, e.g. filled in as part of a loop? In the second case, seems like you could do lazy initialization, and by keeping it local you wouldn't stress the cache so much. – John Bayko Aug 22 '23 at 16:12
  • If your struct elements are of size multiple of 16 (or something else), you can zero out each 16's byte of the structure (assuming it has no padding whatsoever). But I would not be concerned about performance of zeroing out everything unless it is actually affecting the functionality. – Eugene Sh. Aug 22 '23 at 16:17
  • @Kitsune "I was concerned about the performance" --> consider an alternative to _strings_ as many operations are slow. Quicker alternatives exist, yet need to see your code to best advise. – chux - Reinstate Monica Aug 22 '23 at 17:17
  • 1
    Unless you have a *measurable* performance problem, you do not in fact have a performance problem. If you have a measurement, then it needs to be *outside your required parameters*. Everyone wants code that runs in zero time, but in practice, code that runs infrequently and takes a millisecond or two isn't even going to be noticed. – tadman Aug 22 '23 at 17:49
  • 1
    Don't do strange manual optimization before you have confirmed that something is actually a performance bottleneck. – Lundin Aug 23 '23 at 06:59

4 Answers4

4

What would you suggest ?

Set up a test harness to assess performance and then rate the various code options.

Without profiling your code, there remains many implementation dependent issues. Consider that an OS may zero a pool of memory in the free time. E.g. code does a scanf() call and waits for input. During that time, perhaps megabytes of memory are zeroed for later calloc() usage - effectively costing the program zero additional time to zero memory. OS may even use zeroed memory from a pool created before program started.

Even element_arr = calloc_or_malloc(num_elements * sizeof(struct element)) does not certainly use memory at that time, as true usage may get deferred until code accesses the memory.

UpAndAdam
  • 4,515
  • 3
  • 28
  • 46
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
4

You may be thinking about optimization too prematurely. Doing calculations to initialize only part of the memory, or calling memset() manually, can easily end up being a lot slower than just calling calloc().

If you are allocating 6MiB of memory, you are likely exceeding M_MMAP_THRESHOLD, so that memory will come straight out of a dedicated mmap() call and will be already zeroed to begin with, without the need of doing anything. Therefore, it most likely doesn't make sense to use anything other than calloc() here.

Do the selective zeroing only if you are using some exotic libc implementation, and perform the appropriate benchmarks to determine its performance. In case you are using glibc, musl or other major C libraries, you should avoid zeroing manually on such a large allocation and use calloc().

In any case, also consider that modern CPUs can perform zeroing at speeds in the order of gigabytes per second, so if this is something you're only doing once or twice during the entire runtime of your program, it's going to be unnoticeable, and it's again easier to just calloc() instead of manually and selectively zeroing, which could result in mistakes/bugs.

Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
3

You should use the simplest approach: if you can take advantage of the zero initialization performed (mostly for free) by calloc, then use calloc to allocate the full array.

Trying to benchmark versions where you use malloc and calloc is surprisingly tricky: for a large block of memory, malloc and calloc are likely to get a fresh set of memory mapped pages from the operating system, which will be zero initialized in modern operating systems for security reasons anyway, to prevent processes from eavesdropping on other processes freed memory.

This means zero initialization comes essentially for free, and manually zeroeing memory is almost always less efficient.

To make matters even more surprising, calloc may return a block of pages that are actually aliases to the same zero intialized page, mapped with copy-on-write flags, so the pages will only be mapped, and intialized when actually modified by the program.

You should read these questions and pages:

chqrlie
  • 131,814
  • 10
  • 121
  • 189
0

From what you say in your text, after the code, IMHO, the best to initialize your struct is to just put a 0 in the first byte of each string array, this can be done with a simple loop like:

struct element *p = element_arr;
for (int i = 0; i < 64 * 1000 * 1000; i++, p++) {
    p->foo[0] = '\0';
    p->bar[0] = '\0';
    /* ... */
}

which will store 0's only in the interesting places, and won't require to fill the full array. This will be more efficient, if you have an average of 32-64 bytes per array in the struct, at an average of 48bytes per field, you will have around of 5 fields in the structure and you will make 125,000 assignments of one char, instead of 6,000,000, a good improvement. Right?

Luis Colorado
  • 10,974
  • 1
  • 16
  • 31
  • interesting, you are essentially suggesting my option 3. from the other answers I got, I don't see this one favorable any more: It's minimalistic (no superfluous operation), but hard to maintain. It's not necessarily faster than optimized vector instructions. OS has lots of zero-initialized memory (basically) for free. (un)fortunately I don't have a problem size this big to measure a significant difference (I tried). – Kitsune Aug 24 '23 at 18:34
  • Well, updating 125,000 memory cells against 6,000,000 filling the whole space, requires more code than a plain `memcpy(3)` call, but I have to disagree on being minimalistic (minimalistic is good in programming) and hard to maintain (of course) but when you use malloc() I'm afraid you don't always get clean memory from the system, but memory used already in malloc. So you can think otherwise, but initializing just the first byte, against the full memory is incurrinn in a penalty of 48:1 memory accesses. Simulating the measurements should not be a problem. Please go ahead! – Luis Colorado Aug 27 '23 at 17:15