Fast binary parser algorithm

Question

I'm writing a parser for binary files. The data is stored in consecutive 32 bit records. The files only have to be read once and as this is done it is fed into the analysis algorithm.

Now I am reading the file in chunks of 1024 records to avoid as much of the overhead from calling fread more frequently than necessary as possible. In the example below I use oflcorrection, timetag and channel as outputs for the algorithms and use the bool return value to check if the algorithm should stop. Also note that not all the records contain photons just those with positive values.

With this approach I can process at up to 0.5GBps or 1.5 GBps if I use the threaded version of the algorithms which break the file into pieces. I know my SSD can read at least a 40% faster. I was thinking of using SIMD to parse several records in parallel but I don't know how to do it with the conditional return clauses.

Do you know any other approach that would allow me to combine chunked reading and SIMD? Is there in general a better way of doing it?

Thanks

P.S. The records correspond to either photons arriving to detectors after going through a beam splitter or a special record that indicates an overflow condition. The latter is needed because Timetags are stored with picosecond resolution in uint64_t.

 static inline bool next_photon(FILE* filehandle, uint64_t * RecNum,
                               uint64_t StopRecord, record_buf_t *buffer,
                               uint64_t *oflcorrection, uint64_t *timetag, int *channel)
{
    pop_record:
    while (__builtin_unpredictable(buffer->head < RECORD_CHUNK)) { // still have records on buffer
        ParseHHT2_HH2(buffer->records[buffer->head], channel, timetag, oflcorrection);
        buffer->head++;
        (*RecNum)++;

        if (*RecNum >= StopRecord) { // run out of records
            return false;
        }

        if (*channel >= 0) { // found a photon
            return true;
        }
    }
    // run out of buffer
    buffer->head = 0;
    fread(buffer->records, RECORD_CHUNK, sizeof(uint32_t), filehandle);
    goto pop_record;
}

Please find below the parsing function. Keep in mind that I can't do anything about the file format. Thanks again, Guillem.

static inline void ParseHHT2_HH2(uint32_t record, int *channel,
                                 uint64_t *timetag, uint64_t *oflcorrection)
{
    const uint64_t T2WRAPAROUND_V2 = 33554432;
    union{
        uint32_t   allbits;
        struct{ unsigned timetag  :25;
            unsigned channel  :6;
            unsigned special  :1;
        } bits;
    } T2Rec;

    T2Rec.allbits = record;

    if(T2Rec.bits.special) {
        if(T2Rec.bits.channel==0x3F) {  //an overflow record
            if(T2Rec.bits.timetag!=0) {
                *oflcorrection += T2WRAPAROUND_V2 * T2Rec.bits.timetag;
            }
            else {  // if it is zero it is an old style single overflow
                *oflcorrection += T2WRAPAROUND_V2;  //should never happen with new Firmware!
            }
            *channel = -1;
        } else if(T2Rec.bits.channel == 0) {  //sync
            *channel = 0;
        } else if(T2Rec.bits.channel<=15) {  //markers
            *channel = -2;
        }
    } else {//regular input channel
        *channel = T2Rec.bits.channel + 1;
    }
    *timetag = *oflcorrection + T2Rec.bits.timetag;
}

I came up with an almost branchless parsing function, but it doesn't produce any speed up.

if(T2Rec.bits.channel==0x3F) {  //an overflow record
        *oflcorrection += T2WRAPAROUND_V2 * T2Rec.bits.timetag;
    }
    *channel = (!T2Rec.bits.special) * (T2Rec.bits.channel + 1) - T2Rec.bits.special * T2Rec.bits.channel;
    *timetag = *oflcorrection + T2Rec.bits.timetag;
}

The label and goto statement are undesirable; it is not clear that they're needed. You could perfectly well use `for (;;)` or `while (1)` around the body of the label/goto loop. You should not ignore the return value from `fread()`; it tells you how much, if any, data was read. You cannot write reliable code if you ignore that return value. — Jonathan Leffler, Dec 06 '17 at 14:19
What does `ParseHHT2_HH2(buffer->records[buffer->head], channel, timetag, oflcorrection);` do? BTW: passing and dereferencing the pointers looks costly. — wildplasser, Dec 06 '17 at 14:20
@JonathanLeffler I liked more how the goto statement looked nothing else. With respect to the return value of fread, the header of the file contains how many records it contains. So the StopRecord avoids the problem of trying to read more records than available. — GuillemB, Dec 06 '17 at 14:30
@wildplasser ParseHHT2_HH2 takes a 32bit record from the buffer and puts into channel and timetag the channel at which it was detected and the time of arrival. Oflcorrection takes care of the time overflows in the timetags. I will edit the question adding the parsing function. — GuillemB, Dec 06 '17 at 14:34
You are supposing, @GuillemB, that your files are always well formed and that no I/O errors occur. Neither of those is a safe assumption. Check the return values of your function calls. — John Bollinger, Dec 06 '17 at 14:34
@JonathanLeffler and JohnBollinger. You are right of course, I was obsessed with trying to make it go fast that I thought another if would kill me. Of course it doesn't as it is called very infrequently. On that topic, the cost of the if clause on the channel conditions is huge. By eliminating and (of course killing the algorithms that come after) I bump my parsing speed by a factor 2 on a trivial function that only reads the total number of photons in the file.. — GuillemB, Dec 06 '17 at 14:51
@GuillemB, yes, indirect accesses (via pointers) are likely to be far more costly than direct ones, especially to local variables. — John Bollinger, Dec 06 '17 at 14:55
Also: the amount of conditions inside the main loop (plus the amount of *active expressions*) will effectively trash the branch prediction. In any case: profile & check the generated assembler source. — wildplasser, Dec 06 '17 at 14:59
I was assuming that the branch predictor was completely thrased. For a typical experiment overlfow records, channel 0 and channel 1 photons are in a ratio of 1:1.5:1.5 and going into either channel 0 and channel 1 is a true random process. I tried using the clang __builtin_unpredictable to mark the random ifs but it made little to no difference. — GuillemB, Dec 06 '17 at 15:12
Have you tested your performance without the data processing, i.e. don't call `ParseHHT2_HH2` but just add all the 32 bit values together (or some other simple low cost action). This will tell you how much you can expect from the file system. — Support Ukraine, Dec 06 '17 at 17:24
`Now I am reading the file in chunks of 1024 records` -->> increment your buffer size to **at least** a few MB. — wildplasser, Dec 06 '17 at 19:20

Pedro · Answer 1 · 2017-12-06T21:07:54.237

I/O is very likely to dominate the runtime of your function. That said, you should first measure the speed without parsing, i.e. just the fread. Probably it won't differ that much to the speed including parsing.

If so, you can concentrate on optimizing that bottleneck first. Look into the linux tool fio, in particular with different --ioenginge= (also libaio). In case you are using an NVMe disk, look into the Intel SPDK.

Apart from that, you can optimize the parsing further. You can avoid both the (*RecNum)++ and more importantly the first if-clause within the loop, since after the fread you know how many records you will read, so you can use that information.

Furthermore, I would not iterate over buffer->head but use a local variable for that, using a for-loop.

I also would use a local variable for *RecNum and only at the end assign to *RecNum. If you are aiming at parallel writing to *RecNum, your code would be buggy anyways, because neither your increment nor your read uses an atomic operation.

Not until then you should start thinking about SSE or AVX. If you have mostly zeroes in *channel, you could use SSE/AVX to check 16 or more bytes at once for greater or equal zero.

Update:
Now after you providing the code of your parse function, I can see that situation is different. Many branches there...

Update:
Here is an implementation of the optimizations for next_photon that I mean. If buffer->head == 0 is guaranteed when entering next_photon, it can be simplified. And I assume you don't check the return value of fread on purpose, because you want to handle that with StopRecord only. So I left it like that even though it's not safe.

static inline bool next_photon(FILE* filehandle, uint64_t *RecNum,
                            uint64_t StopRecord, record_buf_t *buffer,
                            uint64_t *oflcorrection, uint64_t *timetag,
                            int *channel)
{
    int recNum = *RecNum;
    int i = buffer->head;

    while (true) {
        int records;
        bool quit;

        if (StopRecord - recNum <= RECORD_CHUNK - i) {
            records = i + StopRecord - recNum;
            quit = true;
        } else {
            records = RECORD_CHUNK;
            quit = false;
        }

        const int i0 = i;

        for (; i < records; i++) { // still have records on buffer
            ParseHHT2_HH2(buffer->records[i], channel, timetag, oflcorrection);

            if (*channel >= 0) { // found a photon
                *RecNum = recNum + i - i0 + 1;
                buffer->head = i + 1;
                return true;
            }
        }

        recNum += records - i0;

        if (quit) {
            break;
        }

        // run out of buffer
        i = 0;
        fread(buffer->records, RECORD_CHUNK, sizeof(uint32_t), filehandle);    
    }

    *RecNum = recNum;
    buffer->head = i;
}

Except it seems clear that the OP's code is *not* I/O bound, for he sees a substantial speedup from parallelizing the analysis, and even then he is not (he thinks) saturating his I/O bandwidth. — John Bollinger, Dec 06 '17 at 14:31
How come you can be 100% sure about that? Optimizing SSD access is not that trivial. We don't know the choice of `RECORD_CHUNK`. — Pedro, Dec 06 '17 at 14:34
That parallelizing the analysis increases the throughput by a factor of 3 shows that the analysis cost is of the same order of magnitude as the I/O cost. If the program were strictly I/O bound then speeding the analysis part would not significantly improve overall throughput. — John Bollinger, Dec 06 '17 at 14:40
For a 1,5 GB file the whole thing is parsed on 3 seconds for the single threaded program. If I just return from the parsing function and skip the actual parsing it takes around 0.9 seconds. The simpler algorithms used to analyze the data do take 3 seconds. The RECORD chunk is 1024 and each record is 4 bytes so I'm reading 4kB chunks. Any multiple of 1024 give me a similar performance. — GuillemB, Dec 06 '17 at 15:04
@JohnBollinger: Well, it could be that the threading speeded up the read performance. But since `fread` does cached reading (cached by the OS), you are right I guess. With `O_DIRECT`, situation would be completely different. What I actually wanted to say: just make sure that with just the fread you can reach the max read bandwidth of your disk. But probably he can. — Pedro, Dec 06 '17 at 16:27
@PedramAzad I know there are a lot of branches... I made an almost branchless parser. See edit above. This doesn't produce to my surprise any speedup. The only thing that seems to make a difference is getting rid of the ifs in the next_photon function. That's why I would like to have a more abstract answer in terms of how computation could be rearranged to possibly use SIMD plus the chunking. — GuillemB, Dec 06 '17 at 16:40
I guess the statistic of the flags/data in your file is the reason that removing those ifs doesn't change much (those are nested compared to the ones in `next_photon`, which are always evaluated). I still have to say, I wouldn't concentrate on SIMD here that much. Btw, four threads because you only have four cores on your CPU? Decoupling reading from processing as suggested by @JohnBollinger could help in fact. — Pedro, Dec 06 '17 at 17:16
I'm coding a 2 core i7 machine. From 1 thread to 2 there is a big speed going from 2 to 4 gets you a bit less than a 20% over the 2 thread case. I have some trouble on seeing how @JohnBollinger 's would be implemented. As I understand it I would reserve one thread for reading from the file and use the others for computation. I don't understand why this would be better than having the 4 threads work on 4 separate sections of the file. Could you help me understand? — GuillemB, Dec 06 '17 at 17:29
Good point. Now that I think about it, I don't see that either. — Pedro, Dec 06 '17 at 19:12

score 0 · Answer 2 · answered Dec 06 '17 at 14:35

You are accessing to disk in loop and I don't think SIMD will help too much there, you could use mmap.

Check these answers:

When should I use mmap for file access?

Fastest file reading in C

but you could also use SIMD (SSE/AVX/NEON) for other parts e.g in parsing code

score 0 · Answer 3 · answered Dec 06 '17 at 14:50

0

That speeding the data analysis by parallelizing it has such a dramatic effect on your program's throughput shows that the data analysis cost is of the same order of magnitude as the I/O cost. Therefore, if you want to improve its throughput to be closer to the limit imposed by your available I/O bandwidth, the best course of action would probably be to perform analysis and I/O in parallel.

You can do that by maintaining two separate I/O buffers, processing one while you read into the other, and then flipping.

answered Dec 06 '17 at 14:50

John Bollinger

160,171
8
81
157

1

As an example of a more complex analysis of the file is for example computing correlations in between the arrival time of photons in both channels. This amounts to computing the delta between times of arrivals and putting it into a histogram. The 4 threaded version of this algorithm takes 1.4 seconds (again for the 1.5 GB file) compared to 1 second for the counting photons function (with 4 threads). I was very surprised by the fact that eliminating the if(*channel>=0) boosted the speed of the count_photon function to 0.6 seconds. – GuillemB Dec 06 '17 at 15:18

Fast binary parser algorithm

3 Answers3