Memory alignment vs Cpu utilization

Question

I am having an ARM platform running Linux where the L1 line is 64bytes long.

I decided to replace the malloc (via LD_PRELOAD) with another malloc where the memory must be 64bytes aligned no matter the size given to malloc.

I was expecting to see that the memory consumption increases (which actually happened) while seeing that the CPU utilization goes down. This didn't happen. In other words, both memory and Cpu utilization went up.

How can this be explained?

Thanks,

`How can this be explained?` - this is not enough information. What platform are you running? Exactly? ARM is a [family of platforms](https://en.wikipedia.org/wiki/ARM_architecture) that support specific architecture set, there are many ARM platforms. Do you have 8-bit processor? 128-bit? Have you precluded the possibility, that your `malloc` is just slower and eats more memory? How do you measure cpu utilization and memory utilization? How did you conducted the tests? Show your malloc implementation? How did you run it? Substitute it? Did you substitute the whole C library? And so on. — KamilCuk, Apr 02 '19 at 06:44
How to reproduce the problem? Well, I have on my desk (some) ARM platform, there is even a phone right now with Cortex-A53 in my pocket. Can I test your statement? Reproduce the conditions? What conditions? How am I going to do that? Did you share the source code? Did you share your `malloc` implementation? Did you share your original malloc implementation (glibc? musl? newlib?)? Did you share how did you arrive at your thesis? Please make your problem reproducible, so others can test it, verifiable, so others can verify it. Create an [MCVE](https://stackoverflow.com/help/mcve) — KamilCuk, Apr 02 '19 at 06:49

score 3 · Answer 1 · answered Apr 02 '19 at 06:54

It depends on what you malloc(). If you use malloc() for large chunks of data, this should not make real differences. But if you malloc() elements smaller that 64 bytes, you will probably not use the cache efficiently.

malloc() allocates elements in memory in the program order. If several malloc()s are close, elements will be in successive memory addresses and it is likely that they will be used together as they have been created at the same time. This is the so called spatial locality principle. Of course nothing is guaranteed, especially with dynamically allocated data, but spatial locality is observed in most programs. The practical implication of this principle is that it allows a better use of caches. A cache miss is expensive (you have to fetch 64 bytes from memory), but if you use elements close in memory, you have to pay it it only once.

So, if separately allocated data are in the same cache line, fetching one of these elements will bring you for free other elements close in memory. But if each element is occupies a complete cache line as it is with your modified allocator, it is no longer true. Every access to a data will be a cache miss, the number of data that your cache can hold will reduced and you will have the impression that cache size is reduced. The gross result will be an increase in your computation time.

As well, you will have TLB misses (MMU lookups) and possible swapping. Note, linux does 'demand paging' for code inodes and even if you have a system without swap, you maybe kicking code out of memory; to be reloaded from the inode later. As well: [Skip copy on realloc](https://stackoverflow.com/questions/16765389/is-it-true-that-modern-os-may-skip-copy-when-realloc-is-called) for other allocator nuances; which basically re-iterates the cache is the dominate issue. — artless noise, Apr 02 '19 at 17:16

Memory alignment vs Cpu utilization

1 Answers1