3

I have a block of data in memory starting at memory_ptr. What this program does is byte reverse every qword and then write that to a file. For example, 18171615141312112827262524232221 will be written as 11121314151617182122232425262728 and so on.

I want this to run as fast as possible, so I've stored the flipped bytes in memory, and then write it to the file once its at 8KB to reduce the # of calls to fwrite(). However, I feel like there is a simpler and faster way of doing this rather than increasing the size of malloc to further reduce calls to fwrite. Any ideas on how to speed this up?

#define htonll(x) (((uint64_t)htonl((x) & 0xFFFFFFFF) << 32) | htonl((x) >> 32))

uint64_t data;
int total_chunks = 1000;
int chunk_size = 8192;
char *tmp = malloc(8192);
FILE *fp = fopen("data.bin","wb");

for (int i = 0; i < total_chunks; i++) {
    for (int j = 0; j < chunk_size; j+=8) {
        memcpy(&data, memory_ptr, 8);
        data = htonll(data);
        memcpy(tmp+j, &data, 8);
        memory_ptr+=8;
    }
    fwrite(tmp, 8192, 1, fp);
    memset(tmp,0,8192);
}
MuhminA
  • 115
  • 1
  • 8
  • 4
    stdio uses output buffering, so you don't really need to minimize the number of calls to fwrite. It will write to the file when its internal buffer is full. – Barmar Jan 24 '22 at 22:55
  • @harold which part of the code isn't matching the description? – MuhminA Jan 24 '22 at 23:20
  • 2
    "I want this to run as fast as possible" -- you are doing _way too many_ unnecessary `memcpy`ies and `memset`s. – Employed Russian Jan 24 '22 at 23:43
  • 1
    @EmployedRussian note that `memcpy` and `memset` function calls with a small fixed size are optimized by compilers. They generally remove the function call, they can remove useless copies and generates fast instructions like SIMD moves. I would rather be worried about the `htonll` macro as not all compilers may optimize it (but good ones should). – Jérôme Richard Jan 24 '22 at 23:49
  • 1
    your macro changes ABCD to DCBA , not BADC as in your description – M.M Jan 24 '22 at 23:52
  • @Barmar stdio calls are guaranteed thread safe though so there might be slowdown due to lock/unlock , e.g. if we decided to `fputc` everything – M.M Jan 24 '22 at 23:55
  • @harold My example description was wrong, I've changed it to reflect the code – MuhminA Jan 24 '22 at 23:56
  • @M.M I'll bet there's an optimization that skips that if the process is not multi-threaded, which is true for the vast majority of applications. – Barmar Jan 25 '22 at 00:04
  • 1
    Can you do the swapping in place, without allocating extra buffer? – Lev M. Jan 25 '22 at 00:38
  • @LevM. The `tmp` buffer should fit in the L1 cache on nearly all modern desktop processors (even cheap 17-year old Cortex-A8 ARM processors have at least 16 KiB of L1 cache). The throughput of the L1 cache is huge (much more than the RAM, which is much more than the one of the fastest SSDs). For example a 4.5GHz 5-year-old Intel Skylake processor can copy a buffer at the speed of up to 134 GiB/s per core. Most SSD do not even reach 5 GiB/s. – Jérôme Richard Jan 25 '22 at 01:11
  • 1
    *Before* you start asking "how to make this run faster", you should profile it. There's no point in making any change, if you can't know if it actually makes any difference, or even makes things slower (can happen unexpectedly for example because of CPU cache behavior). Anyway, the performance of this code is probably absolutely dominated by `fwrite` on modern CPUs and compilers, with optimizations turned on. If you really want the best performance, check out this: [https://stackoverflow.com/questions/794632/programmatically-get-the-cache-line-size](https://stackoverflow.com/q/794632/1717300). – hyde Jan 25 '22 at 11:39

1 Answers1

4

TL;DR: tune your compiler flags (or use SIMD intrinsics/libraries if your compiler it not very clever) and memory mapped files. If this is not enough, you can use multiple thread with OpenMP and lower-level API avoiding many overheads at the expense of less portable and more complex code.

Vectorization using SIMD instructions

Your current code is already quite good assuming the compiler you use is tuned to generate a fast code and its heuristics are sufficiently good to generate it. Note that that memcpy and memset function calls with a small fixed size are optimized by compilers (at least for Clang, GCC and ICC). Clang does generate a nearly optimal code for x86-64 modern processors with the flags -O3 -mavx2 which assume the target machine running the program has the AVX2 instruction set (most x86 processors does have it but not all, especially old ones). Here is the generated assembly code of the hot loop:

.LBB0_7:
        vmovdqu ymm0, ymmword ptr [r14 + 8*rcx]
        vmovdqu ymm1, ymmword ptr [r14 + 8*rcx + 32]
        vmovdqu ymm2, ymmword ptr [r14 + 8*rcx + 64]
        vmovdqu ymm3, ymmword ptr [r14 + 8*rcx + 96]
        vpshufb ymm0, ymm0, ymm4
        vpshufb ymm1, ymm1, ymm4
        vpshufb ymm2, ymm2, ymm4
        vpshufb ymm3, ymm3, ymm4
        vmovdqu ymmword ptr [rbx + 8*rcx], ymm0
        vmovdqu ymmword ptr [rbx + 8*rcx + 32], ymm1
        vmovdqu ymmword ptr [rbx + 8*rcx + 64], ymm2
        vmovdqu ymmword ptr [rbx + 8*rcx + 96], ymm3
        add     rcx, 16
        cmp     rcx, 1024
        jne     .LBB0_7

The code shuffle 128 bytes at once in only few cycles (only 4 cycles on my Intel Skylake processor)!

If you are unsure about using the AVX2 instruction set, do not worry, since compilers like GCC and Clang already generate a quite good code with the AVX instruction set (which is available on almost all x86-64 modern processors). You can see that on godbold. Using the flag -mavx is enough to generate a fast code (for Clang/GCC/ICC). For MSVC you need to respectively use the flags /arch:AVX and /arch:AVX2. If you want to support nearly all x86-64 processors at the expense of a slower code, you can use the -mssse3 flag to use SSE instead of AVX.

Note about the support of SIMD instruction sets: the steam survey reports that AVX2 is supported by 87% of its users, AVX by 95% and SSSE3 by 99.3%.

Note that the same thing applies for other hardware architecture: you mainly just need to enable the good flags of your compiler.


Additional optimizations

The final memset(tmp,0,8192); call is not needed since tmp is only written. You can remove it.

fwrite is generally pretty fast because the libc use its own (quite large) buffer internally and it should be relatively adapted to your hardware. However, the data buffers requested from the operating system (OS) need to be copied and this copy is not done efficiently by most kernel implementation (eg. Linux, for complex technical reasons related to the use of SIMD instructions). This copy is generally not a problem in term of performance on hard disk drives (HDD). However, it can introduces an overhead on fast solid-state drives (SSD), especially high-throughput NVMe ones capable on writing several GiB/s. One way to speed this up, it to use mapped memory files (see here, here and there for more informations).

If you have a very fast SSD and using memory mapped files is not enough, then you can try to parallelize the loop using OpenMP. This should be very simple in your case since the loop is embarrassingly parallel: just add a #pragma omp parallel for. Note that this solution can be slower and some slow SDD and will certainly be slower on HDD (which does not like non-sequential access nor parallel ones).

If this is still not enough, you can try experimental solutions like using liburing which use the new IO_uring kernel interface available only on Linux (Windows seems to have a similar interface called IoRing). This interface is designed to avoid copies (ie. zero-copy) and system calls (no system calls in hot loops) resulting in a very fast code with almost no overhead. However, using it directly will make your code less portable and much more complex. The above method should probably be enough for you.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59