How to optimize binary file(over 1MB) read in C?

Question

I need to read two 1MB+ binary files byte by byte, compare them - If they're not equal, print out the next 16 bytes starting at the unequal byte. The requirement is that it all runs in just 5msecs. Currently, my program is taking 19msecs if the unequal bit is at the end of the two files. Are there any suggestions as to how I can optimize it?

#include <stdio.h>  //printf
#include <unistd.h> //file open
#include <fcntl.h>  //file read
#include <stdlib.h> //exit()
#include <time.h>   //clock

#define SIZE 4096

void compare_binary(int fd1, int fd2)
{   
    int cmpflag = 0;
    int errorbytes = 1;
    char c1[SIZE], c2[SIZE];
    int numberofbytesread = 1;

    while(read(fd1, &c1, SIZE) == SIZE && read(fd2, &c2, SIZE) == SIZE && errorbytes < 17){
        for (int i=0 ; i < SIZE ; i++) {
            if (c1[i] != c2[i] && cmpflag == 0){
                printf("Bytes not matching at offset %d\n",numberofbytesread);
                cmpflag = 1;
            }
            if (cmpflag == 1){
                printf("Byte Output %d: 0x%02x 0x%02x\n", errorbytes, c1[i], c2[i]);
                errorbytes++;
            }
            if (errorbytes > 16){
                break;
            }
            numberofbytesread++;
        }
    }
}

int main(int argc, char *argv[])
{
    int fd[2];
    if (argc < 3){
        printf("Check the number of arguments passed.\n");
        printf("Usage: ./compare_binary <binaryfile1> <binaryfile2>\n");
        exit(0);
    }
    if (!((access(argv[1], F_OK) == 0) && (access(argv[2], F_OK) == 0))){
        printf("Please check if the files passed in the argument exist.\n");
        exit(0);
    }

    fd[0] = open(argv[1], O_RDONLY);
    fd[1] = open(argv[2], O_RDONLY);

    if (fd[0]< 0 && fd[1] < 0){
        printf("Can't open file.\n");
        exit(0);
    }
    clock_t t;
    t = clock();
    compare_binary(fd[0], fd[1]);
    t = clock() - t;
    double time_taken = ((double)t)/(CLOCKS_PER_SEC/1000);
    printf("compare_binary took %f milliseconds to execute \n", time_taken);
}

Basically need the optimized way to read binary files over 1MB such that they can be done under 5msecs.

2+ MB in just 5 ms is over 400 MB/sec. What kind of disk are you reading from? — Andrew Henle, Apr 22 '18 at 20:29
@technosaurus `mmap()` is usually one of the *slowest* ways to read a file sequentially. See https://marc.info/?l=linux-kernel&m=95496636207616&w=2 for a rather authoritative statement: "But ... (just copying the data once) is probably pessimal for mmap(). Linus" — Andrew Henle, Apr 22 '18 at 20:33
And drop the calls to `access()`. First, they're unnecessary and you're wasting two system calls to check if the files are there. If they're not there, the `open()` call will fail. Second, "check X then do Y" is *always* [a TOCTOU bug](https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use). — Andrew Henle, Apr 22 '18 at 20:37
@AndrewHenle: That post is from 2000. With `mmap(MAP_POPULATE)` you avoid the page faults, and modern CPUs might be different. Last time I tested `mmap` vs. `read`, it wasn't nearly that clear-cut, but I forget the exact results. IIRC I was testing a use-case that wanted a byte-swapped copy of a file, so only reading the actual file data once, but with not much work done per byte in the initial pass, so for `read` it was optimal to go in small chunks so data would still be hot in L2 or L1d after `read` copies from the pagecache into a user-space buffer. — Peter Cordes, Apr 22 '18 at 21:06
In my tests, for sequential read of a cached file `mmap` is somewhat faster than `read()`, even without `MAP_POPULATE`. Recent kernels do "fault around" bringing in nearby pages when one faults so you get way fewer than 1 page faults per page. — BeeOnRope, Apr 23 '18 at 02:56
@PeterCordes No matter how much page faulting gets optimized, there's still more work involved in using `mmap()` than there is with `read()` calls. The big thing for this question, though, is the required throughput of 400 MB/sec. That's going to be hard to meet on most hardware no matter what method is used to get the data into userspace memory. — Andrew Henle, Apr 23 '18 at 09:50
@AndrewHenle: Yes, IIRC I found that with a well-tuned `read` block size, it was faster than `mmap`. But I forget if doing one giant `read` of ~100MB and then looping over that was faster or not, even on a Skylake with fast DDR4-2666 (so memcpy bandwidth would be excellent, making the extra L3 cache misses from `read` less bad). And BTW, 400MB/sec is easy when your file data is hot in the pagecache. With small files, it's reasonable to expect them to stay hot in disk cache; I assumed that's what the OP meant, or they have a fast SSD. — Peter Cordes, Apr 23 '18 at 10:03
@Andrew - how is there more work with `mmap`? There is different work for the two cases: the primary cost of `read` vs `mmap` is that it has to make a redundant copy from the page cache into the user-land buffer. So for the OPs question every byte is read twice and written once, and two thirds of that work is due to the read call. OTOH, `mmap` is zero copy so each byte is only read once. `mmap` comes at the cost of more page table manipulation, however. What's faster depends on the hardware costs and OS implementation. — BeeOnRope, Apr 23 '18 at 14:32
BTW, I assume the disk speed is irrelevant here since the data will be cached as the OP repeatedly runs the benchmark - unless they are jumping through the hoops to drop the page cache. So 400 MB/s is well within reach: on my laptop I get about 10 GB/s for the `mmap` approach and a bit more than half that for `read`. Note that `read` has fallen further behind `mmap` on recent kernels due to the Meltdown and Spectre fixed which increased the cost of system calls. — BeeOnRope, Apr 23 '18 at 14:37
@Shantanu: what OS and hardware are you targeting? x86 Linux on a many-core Xeon? On a desktop? Or AMD? OR embedded ARM? What compiler? You are compiling with optimization enabled, right? `gcc -O3 -march=native` or whatever. — Peter Cordes, Apr 24 '18 at 02:53

score 3 · Accepted Answer · answered Apr 22 '18 at 19:18

3

First, try reading larger blocks. There's no point in performing so many read calls when you can read everything at once. Using 2 MB of memory is not a deal nowadays. Disk I/O calls are inherently expensive, their overhead is significant too, but can be reduced.

Second, try comparing integers (or even 64-bit longs) instead of bytes in each iteration, that reduces the number of loops you need to do significantly. Once you find a missmatch, you can still switch to the byte-per-byte implementation. (of course, some extra trickery is required if the file length is not a multiple of 4 or 8).

answered Apr 22 '18 at 19:18

PMF

14,535
3
23
49

I did try increasing the chunk size. Didn't notice significant difference after 4096. How do you suggest I should do the second point? Comparing integers and the char. I'm not sure I got that. – Shantanu Mhapankar Apr 22 '18 at 20:12
1

@ShantanuMhapankar: Reading 4k or 8k at once is probably good; data from the first file can still be hot in L1d cache after reading a block from the 2nd. For point 2) use `memcmp` over your buffers. The common case is no differences, right? So use `memcmp` to see if that's true, and if not then find the position of the difference with a byte loop over that 4k block. (There's no version of `memcmp` that gives you the position of the first difference, unfortunately. But any good libc will have an asm hand-optimized `memcmp`, e.g. on x86 using AVX2 `vpcmpeqb` / `vpmovmskb` to check 32B chunks – Peter Cordes Apr 22 '18 at 21:11
@ShantanuMhapankar: [more about memcmp vs. byte loops](https://stackoverflow.com/questions/49950747/why-is-string-comparison-so-fast-in-python/49991933#49991933). Probably align both your buffers by 4k (or at least 64 bytes). Depending on your target platform, you might want to hand-vectorize your own `memcmp` that gives you the position if there is a difference, rather than just equal / not-equal, taking advantage of known alignment / size. Maybe even a sentinel difference byte at the end so you don't need a loop counter, just a vector compare. – Peter Cordes Apr 24 '18 at 02:32

score 0 · Answer 2 · answered Apr 22 '18 at 19:19

first thing caught my eye is this

 if (cmpflag == 1){
            printf("Byte Output %d: 0x%02x 0x%02x\n", errorbytes, c1[i], c2[i]);
            errorbytes++;
        }
        if (errorbytes > 16){
            break;
        }

yourcmpflag checking is useless maybe this thing do a little optimaztion

  if (c1[i] != c2[i] && cmpflag == 0){
            printf("Bytes not matching at offset %d\n",numberofbytesread);
            printf("Byte Output %d: 0x%02x 0x%02x\n", errorbytes, c1[i], c2[i]);
            errorbytes++;
            if (errorbytes > 16){
                break;
            }
        }

you can do array compare built in function, or increase your buffer too

I need to measure the next 16 bytes after an unequal byte has been detected. So the next 16 unequal bytes. — Shantanu Mhapankar, Apr 22 '18 at 19:50

How to optimize binary file(over 1MB) read in C?

2 Answers2