Optimizing Calculation to Get Length of Longest Sequence of Zero Bytes in Binary File

Question

I need to calculate the length of the longest sequence of zero bytes in a binary file as fast as possible. I have a basic implementation in C++ below:

#include <iostream>
#include <fstream>
#include <algorithm>
#include <string>

int get_max_zero_streak(std::string fname)
{
  std::ifstream myfile(fname, std::ios_base::binary);
  int length = 0;
  int streak = 0;
  while(myfile) 
  {
    unsigned char x = myfile.get(); // unsigned 8 bit integer
    if(x == 0)
    {
      streak += 1;
    }
    else
    {
      length = std::max(length, streak);
      streak = 0;
    }
  }
  return length;
}

int main() 
{
  std::cout << get_max_zero_streak("000_c.aep") << std::endl;
  std::cout << get_max_zero_streak("000_g1.aep") << std::endl;
  std::cout << get_max_zero_streak("000_g2.aep") << std::endl;
  std::cout << get_max_zero_streak("001_c.aep") << std::endl;
  std::cout << get_max_zero_streak("001_g1.aep") << std::endl;
  std::cout << get_max_zero_streak("001_g2.aep") << std::endl;
  std::cout << get_max_zero_streak("002_c.aep") << std::endl;
  std::cout << get_max_zero_streak("002_g1.aep") << std::endl;
  std::cout << get_max_zero_streak("002_g2.aep") << std::endl;
  return 0;
}

This is OK on smaller files, but is extraordinarily slow on larger files (such as 50 GB+). Is there a more efficient way to write this, or is parallelizing it my only hope? I'm reading the files from an NVMe SSD, so I don't think the drive read speed is the limitation.

Unless you have some groovy disk hardware, parallel file access tends to slow things down as the threads compete over access to the disk. — user4581301, Aug 10 '22 at 04:00
Searching is a broad topic. I’m sure there are many ways to improve on what you’re doing. I’d personally read in more than one byte at time and I’d skip anything that is shorter than my current longest length. — Taekahn, Aug 10 '22 at 04:01
But if you read the whole dang file into one big buffer (or chunks of the file into the big buffer if the file's too big to eat all at once) and divide up the buffer among multiple threads, maybe you'll see some improvement. Most likely the cost of reading in the file will dwarf the processing, though. — user4581301, Aug 10 '22 at 04:02
The hard part will be figuring out how to combine the runs that span multiple blocks and threads,. — user4581301, Aug 10 '22 at 04:03
The solution is very depdent on the maximum file size. Is there one? Does it fit in memory, read the whole file once otherwise use memory mapped files — Pepijn Kramer, Aug 10 '22 at 04:14
@user4581301 I think one thread can even be faster, since the CPU will be able to do cache prediction and start preloading data. In many (PC) systems the memory access will be the limiting factor not the processing power. In the end performance is ALWAYS about measuring and determining the next bottleneck to remove. — Pepijn Kramer, Aug 10 '22 at 04:17
Not going to argue that one. Always helpful to start with the dumbest, stupidest solution that you think can possibly work and see how fast it is as a baseline. It's surprising how often it turns out to be fast enough, and how hard you have to work to make a faster solution when it isn't fast enough. — user4581301, Aug 10 '22 at 05:55

score 1 · Answer 1 · answered Aug 10 '22 at 10:33

The main bottleneck of this code is the byte-per-byte IO reads that prevent the loop to saturate most SSDs (by a large margin). Indeed, calling myfile.get() for each iteration is slow and compiler do not optimize it in practice. A solution to fix that is to read data chunk-by-chunk and use an inner loop operating on the current chunk. Chunks must be sufficiently large to amortize the cost of IO calls and sufficiently small so data is read from caches (something like a buffer of 32KiB~256KiB should be good enough).

Another big bottleneck is the computing loop itself: conditions are slow on modern processor unless they can be predicted (eg. if most values are 0 or 1 and there is not much unpredictable switches between both). Using a branchless code should help significantly to make the loop faster but the loop-carried dependency is then the main issue. Clang is able to generate a branchless assembly loop but GCC do not succeed to make this optimization. Still, the loop stay serial and scalar so the processor cannot compute bytes faster than its own frequency. In fact, it is much less than that due to the instruction latency and the dependencies. An efficient solution to fix the problem is to operate on multiple bytes at the same time. One solution is to use x86 SIMD instructions (x86-specific and fast). Another solution is to load 8 bytes and operate on large integer using bit-tweaks but it is a bit tricky to do efficiently and I doubt it will be much faster in the end.

Regarding the SIMD solution, the idea is to load data by block of 16 bytes, compare the 16 bytes to 0 simultaneously and set a bit of an integer to 1 or 0 regarding the result of the comparison. You can then check the number of trailing 0 bits very quickly (with a generic code or with compiler built-ins like __builtin_ctz generating a fast x86 instruction). You finally need to shift the result regarding the number of trailing 0 found. This solution is not very fast if there are many 0-1 switches but very fast when there are many large blocks of 0 (or 1). Note that with AVX-2, it is even possible to compute 32 bytes in a row. The SIMD instructions used to do the operation are _mm_load_si128, _mm_cmpeq_epi8 and _mm_movemask_epi8.

Note that you can use a lookup table (LUT) to speed things up when there is a lot of switches. This is not so simple though because the number of 0 of two consecutive masks needs to be added together. The LUT can include the number of previous pending 0 found so to be even faster. The idea is to find the maximum number of consecutive 0 bits in a 8-bit mask (representing 8 bytes) based on the previous number of 0 bytes found.

Another possible optimization is to compute the chunks in parallel. This idea is to reduce a chunk to 3 values: the number of pending 0 bytes, the number of leading 0 bytes and the maximum number of consecutive bytes found between the two. Note that the special case where all bytes are 0 should be considered. These values must be temporary stored and you can perform a final pass at the end of the file that consist in merging the result of all chunks. This operation can be implemented quite simply if you use OpenMP with tasks (one per chunk).

Jerry Coffin · Answer 2 · 2022-08-10T17:22:25.410

As a starting point, I'd read a large chunk of data at once (or if you're on Linux memory map the file¹).

Then note that we don't care about the length of every run of zero bytes. We only care about finding a run longer than the longest we've found yet. We can use that to improve speed considerably.

For example, let's assume that starting from byte 0, we find a run of 26 zero bytes (and the 27^th is non-zero).

So, the next possible starting point is the 28^th byte. But we don't start counting the length of a run from there. At this point, we care about whether we have a run of at least 27 zeros, starting from byte 28. For that to be the case, all the bytes from 28 to 55 have to be zeros. So we can start from byte 55, and see if it's a zero. Then if (and only if) it's zero, we step backward through the preceding bytes to find the beginning of this string of zeros, compute the location of the earliest end that will give a new longest run starting from that point, and repeat.

But in the (highly likely--255/256 chance, if the data is random) chance that byte 55 is non-zero, we can skip ahead another 27 bytes (to byte 82) and see if it's zero or not. Again, most likely it's not, in which case we skip ahead another 27 bytes.

As the current maximum length grows, the efficiency of this algorithm grows with it. Most of the time we're looking at only one out of every length bytes of the data. If length gets very large, there may easily be entire sectors we never need to read in from secondary storage at all.

In theory you can memory map the file on Windows as well, but at least in my experience, memory mapping tends to be a big win on Linux, but rarely gives much improvement on Windows. If you're using Windows, you're generally better off calling CreateFile directly, and passing FILE_FLAG_NO_BUFFERING. See Optimizing an I/O bound Win32 application for more details.

Optimizing Calculation to Get Length of Longest Sequence of Zero Bytes in Binary File

2 Answers2