I'm writing a parser for binary files. The data is stored in consecutive 32 bit records. The files only have to be read once and as this is done it is fed into the analysis algorithm.
Now I am reading the file in chunks of 1024 records to avoid as much of the overhead from calling fread more frequently than necessary as possible. In the example below I use oflcorrection, timetag and channel as outputs for the algorithms and use the bool return value to check if the algorithm should stop. Also note that not all the records contain photons just those with positive values.
With this approach I can process at up to 0.5GBps or 1.5 GBps if I use the threaded version of the algorithms which break the file into pieces. I know my SSD can read at least a 40% faster. I was thinking of using SIMD to parse several records in parallel but I don't know how to do it with the conditional return clauses.
Do you know any other approach that would allow me to combine chunked reading and SIMD? Is there in general a better way of doing it?
Thanks
P.S. The records correspond to either photons arriving to detectors after going through a beam splitter or a special record that indicates an overflow condition. The latter is needed because Timetags are stored with picosecond resolution in uint64_t.
static inline bool next_photon(FILE* filehandle, uint64_t * RecNum,
uint64_t StopRecord, record_buf_t *buffer,
uint64_t *oflcorrection, uint64_t *timetag, int *channel)
{
pop_record:
while (__builtin_unpredictable(buffer->head < RECORD_CHUNK)) { // still have records on buffer
ParseHHT2_HH2(buffer->records[buffer->head], channel, timetag, oflcorrection);
buffer->head++;
(*RecNum)++;
if (*RecNum >= StopRecord) { // run out of records
return false;
}
if (*channel >= 0) { // found a photon
return true;
}
}
// run out of buffer
buffer->head = 0;
fread(buffer->records, RECORD_CHUNK, sizeof(uint32_t), filehandle);
goto pop_record;
}
Please find below the parsing function. Keep in mind that I can't do anything about the file format. Thanks again, Guillem.
static inline void ParseHHT2_HH2(uint32_t record, int *channel,
uint64_t *timetag, uint64_t *oflcorrection)
{
const uint64_t T2WRAPAROUND_V2 = 33554432;
union{
uint32_t allbits;
struct{ unsigned timetag :25;
unsigned channel :6;
unsigned special :1;
} bits;
} T2Rec;
T2Rec.allbits = record;
if(T2Rec.bits.special) {
if(T2Rec.bits.channel==0x3F) { //an overflow record
if(T2Rec.bits.timetag!=0) {
*oflcorrection += T2WRAPAROUND_V2 * T2Rec.bits.timetag;
}
else { // if it is zero it is an old style single overflow
*oflcorrection += T2WRAPAROUND_V2; //should never happen with new Firmware!
}
*channel = -1;
} else if(T2Rec.bits.channel == 0) { //sync
*channel = 0;
} else if(T2Rec.bits.channel<=15) { //markers
*channel = -2;
}
} else {//regular input channel
*channel = T2Rec.bits.channel + 1;
}
*timetag = *oflcorrection + T2Rec.bits.timetag;
}
I came up with an almost branchless parsing function, but it doesn't produce any speed up.
if(T2Rec.bits.channel==0x3F) { //an overflow record
*oflcorrection += T2WRAPAROUND_V2 * T2Rec.bits.timetag;
}
*channel = (!T2Rec.bits.special) * (T2Rec.bits.channel + 1) - T2Rec.bits.special * T2Rec.bits.channel;
*timetag = *oflcorrection + T2Rec.bits.timetag;
}