0

I am writing a program that reads signal values from a file. My current problem is that I have a problem with the performance, the data cannot be read out fast enough.

The format of how the signal values (most of the time 32bit floating-point) are stored in the file is as follows:

(The file was loaded into main memory, also this is a binary file).

The basic format is as follows:

  1. I have a starting address at which the numbers are stored.
  2. There is always more than one block of data, depending on who created it.
  3. There is one offset per block, as mentioned in 2.
  4. I have given the number of elements I have to read out.

If you had to sketch it, it looks like this:

enter image description here

The blue blocks are the data I have to read out or write into an array, so that they are in a array. The spacing between the blue boxes is always the same, but it can be that the spacing is larger than a cacheline, which (I suspect) causes performance to suffer a lot.

My current code:

void extract_4byte_blocks(
    size_t offset_of_entry,
    size_t offset_between_elements,
    char *arr,
    size_t size_of_array,
    char * start_of_data) {

    char *next_block_start = (char *) start_of_data;

    for (size_t i = 0; i < size_of_array; ++i) {

        char *src = next_block_start + offset_of_entry;
        char *dest = arr + i * 4;
        next_block_start += offset_between_elements;

        memcpy(dest, src, 4);
    }
}

Current profiling data:

Execution time: enter image description here

Cache:

enter image description here

Does anyone have a tip on what I can do to improve performance?


I fired up the profiler again and this time with -O3 and then I remembered why I stopped profiling with -O3. I can't read anything out and the time gain of 2 seconds is not so good that I can say I can live with it.The only thing that has changed is that the compiler has probably inlined everything or something. Which is how the duration gets into the calling function.

enter image description here

enter image description here

As you can see, the main problem is reading. I get the most cache misses, which is what I suspected. I have looked at the offset_between_elements. And as suspected, it can be very large. The offset_between_elements is (for those reads that are the most performance demanding) 2632, 2210, 109 and 366 bytes. Ergo, each read results in a cache miss.

EraZer
  • 75
  • 6
  • 1
    What's the approximate value of `size_of_array`? Thousands? Millions? Billions? – Jabberwocky Aug 29 '23 at 07:56
  • At the moment I am testing with a small date, and here the length is on average 300,000 items, but you can expect it to be up to < 10,000,000. – EraZer Aug 29 '23 at 08:10

2 Answers2

2

Although I have not activated any optimisation in the profiling, I checked with godbolt and, with the index calculation, it optimises itself away

Then your profiling makes no sense at all. Trust your compiler.

When you do not optimize your function accesses memory 24 times. When you optimize it does only 2 times. 12x more memory accesses. Profiling without optimizations makes no sense at all.

extract_4byte_blocks:                       |   extract_4byte_blocks:
        test    rcx, rcx                    |            push    rbp
        je      .L1                         |            mov     rbp, rsp
        add     r8, rdi                     |           mov     QWORD PTR [rbp-40], rdi     
        lea     rcx, [rdx+rcx*4]            |            mov     QWORD PTR [rbp-48], rsi
.L3:                                        |            mov     QWORD PTR [rbp-56], rdx
        mov     eax, DWORD PTR [r8]         |            mov     QWORD PTR [rbp-64], rcx
        add     rdx, 4                      |            mov     QWORD PTR [rbp-72], r8
        add     r8, rsi                     |            mov     rax, QWORD PTR [rbp-72]
        mov     DWORD PTR [rdx-4], eax      |            mov     QWORD PTR [rbp-8], rax
        cmp     rdx, rcx                    |            mov     QWORD PTR [rbp-16], 0
        jne     .L3                         |            jmp     .L2
.L1:                                        |    .L3:
        ret                                 |            mov     rdx, QWORD PTR [rbp-8]
                                            |            mov     rax, QWORD PTR [rbp-40]
                                            |            add     rax, rdx
                                            |            mov     QWORD PTR [rbp-24], rax
                                            |            mov     rax, QWORD PTR [rbp-16]
                                            |            lea     rdx, [0+rax*4]
                                            |            mov     rax, QWORD PTR [rbp-56]
                                            |            add     rax, rdx
                                            |            mov     QWORD PTR [rbp-32], rax
                                            |            mov     rax, QWORD PTR [rbp-48]
                                            |            add     QWORD PTR [rbp-8], rax
                                            |            mov     rax, QWORD PTR [rbp-24]
                                            |            mov     edx, DWORD PTR [rax]
                                            |            mov     rax, QWORD PTR [rbp-32]
                                            |            mov     DWORD PTR [rax], edx
                                            |            add     QWORD PTR [rbp-16], 1
                                            |    .L2:
                                            |            mov     rax, QWORD PTR [rbp-16]
                                            |            cmp     rax, QWORD PTR [rbp-64]
                                            |            jb      .L3
                                            |            nop
                                            |            nop
                                            |            pop     rbp
                                            |            ret        

https://godbolt.org/z/q4nfE7qoK

You may also try to force gcc to unroll the loops by using pragmas

#pragma GCC unroll 16
    for (size_t i = 0; i < size_of_array; ++i) {

But you neet to profile because unrolling loops may have adverse effect on the performance. Everything depends on the program.

For large files I would suggest having another task/thread loading new data from the file whilst you process the current one.

0___________
  • 60,014
  • 4
  • 34
  • 74
  • I have made an edit to the post regarding your answer. In short, I have already switched up the optimisation, the optimisation was not good enough and I could then no longer find the "problem areas" so well with active optimisation. – EraZer Aug 29 '23 at 09:02
  • @EraZer lack of optimizations was **creating** problem areas not exposing them. You were profiling the "noise". Optimized code requires sometimes a bit more advance profiling as naive methods might not work (as functions can be inlined, linker might add some optimizations too). – 0___________ Aug 29 '23 at 09:40
  • Do you have any advice on what I can do in terms of advanced profiling? – EraZer Aug 29 '23 at 10:09
-1

It may depend on the size of your data to read. You can at least eliminate some computations as offset is always the same as also the length of the copy:

char *src = (char *) start_of_data + offset_of_entry;
char *dest = arr;

for (size_t i = 0; i < size_of_array; ++i) {
    memcpy(dest, src, 4);
    dest += 4;
    src += offset_between_elements;
}

Aside: The file was loaded into main memory. If the file is huge, that may not be a good idea.

Jean-Baptiste Yunès
  • 34,548
  • 4
  • 48
  • 69
  • Although I have not activated any optimisation in the profiling, I checked with godbolt and, with the index calculation, it optimises itself away. – EraZer Aug 29 '23 at 08:13
  • I have tested both variants and loading into the main memory outperforms the variant with fread. The binary files are usually no larger than 3GB. – EraZer Aug 29 '23 at 08:15
  • Did you try not to call `memcpy` but use your own basic copy code such as: `dest[0] = src[0]; dest[1] = src[1]`, etc? memcpy call will be eliminated as many of its internal unnecessary optimisation pre-computations. – Jean-Baptiste Yunès Aug 29 '23 at 08:20
  • 2
    Well, the stdio functions come with their own set of challenges. See [What goes on behind the curtains during disk I/O?](https://stackoverflow.com/q/13171052/1553090) You should be profiling with compiler optimizations enabled. Doing otherwise produces results that are potentially meaningless. – paddy Aug 29 '23 at 08:20
  • 1
    Any sane compiler knows how to optimize away a small `memcpy`. That said, it's possible that data alignment may affect performance, if the file format does not align the float data to appropriately-sized boundaries, or if the memory into which it's read is not similarly aligned. – paddy Aug 29 '23 at 08:22
  • @paddy sure! But for 3Gb of data and 435 calls to the function, something is wrong there. I suppose that the mapping of the file (is the file memory mapped? Or?) is certainly not the right solution. – Jean-Baptiste Yunès Aug 29 '23 at 08:23
  • 1
    They could be page faulting. I didn't see any stats there about the virtual memory system. – paddy Aug 29 '23 at 08:23
  • If data is not aligned, the map the file such that `start_of_data + offset_of_entry` is – Jean-Baptiste Yunès Aug 29 '23 at 08:24
  • Yes the file is memory mapped, and it is read only. e.g. mmap(NULL, file_info.st_size, PROT_READ, MAP_PRIVATE, fd, 0); – EraZer Aug 29 '23 at 08:25
  • @EraZer what are the values of `start_of_data` and `offset_of_entry`? – Jean-Baptiste Yunès Aug 29 '23 at 08:27
  • I have also thought about the fact that the data is not very well aligned, but as you can see from the offset_of_entry and offset_between_elements, there is not much you can do here to make the aligment better. Or am I missing something here? – EraZer Aug 29 '23 at 08:27
  • @Jean-BaptisteYunès it depends, i have no influence on it – EraZer Aug 29 '23 at 08:28
  • 2
    You should probably be using the `restrict` keyword here to inform the compiler that the source and destination addresses do not overlap. – paddy Aug 29 '23 at 08:29
  • @Erazer if `offset_between_element` is a distance that ensure you a decent alignement, you can play with the offset at the moment of the mapping to start with a correct alignement. If you have no control, then make some computation to try to align all if possible. – Jean-Baptiste Yunès Aug 29 '23 at 08:30
  • @Jean-BaptisteYunès The offset_between_element is not changeable (it depends on how the data is written, and you have no control over that), the only thing you could do is copy the block to an aligment so that the data can be read better, but I don't think that's possible. – EraZer Aug 29 '23 at 08:32
  • 1
    If the number of copies is small, you may also try to copy small aligned blocks of size 8 (that you know contain your data) and then use a post-copy to copy only useful data at the end (in a kind of garbage elimination)? – Jean-Baptiste Yunès Aug 29 '23 at 08:33
  • 1
    @paddy do u mean, that I should add restrict to src and dest? – EraZer Aug 29 '23 at 08:39
  • 2
    Those microoptimizations will not help as compiler will optimize it even better if OP wants to give him a change – 0___________ Aug 29 '23 at 08:50