Compare file with archived file

Question

I want to check if file and its archive version is the same. I created something like this:

public static class FileUtils
{
    public static bool SameAsArchive(this FileInfo file, string archivedFile)
    {
        using (var ms = new MemoryStream())
        {
            GZip.Decompress(File.OpenRead(archivedFile), ms, true);
            return File.ReadAllBytes(file.FullName).SequenceEqual(ms.ToArray());
        }
    }
}

Is there any faster way of checking that instead of reading all bytes?

Edit

Thanks to @Stig I've created a new version:

public static bool SameAsArchive(this FileInfo file, string archive)
{
    var bytesToRead = 4096;

    var one = new byte[bytesToRead];
    var two = new byte[bytesToRead];

    using (var gs = new GZipStream(File.OpenRead(archive), CompressionMode.Decompress))
    using (var fs = File.OpenRead(file.FullName))
    {
        int file1byte;
        int file2byte;

        do
        {
            file1byte = fs.Read(one);
            file2byte = gs.Read(two);
        }
        while (one.SequenceEqual(two) && (file1byte != 0));

        return file1byte == file2byte && file1byte == 0;
    }
}

But it seems not work properly. For some reason, sometimes I do not read full 4096 bytes from GZipStream:

// This is log how many bytes are read in each `do while` loop iteration

read bytes from fs: 4096,   read bytes from gs: 4096
read bytes from fs: 4096,   read bytes from gs: 4096
read bytes from fs: 4096,   read bytes from gs: 4096
read bytes from fs: 4096,   read bytes from gs: 4096
read bytes from fs: 4096,   read bytes from gs: 770
read bytes from fs: 4096,   read bytes from gs: 4096
read bytes from fs: 4096,   read bytes from gs: 4096
read bytes from fs: 4096,   read bytes from gs: 4096
read bytes from fs: 4096,   read bytes from gs: 4096
read bytes from fs: 4096,   read bytes from gs: 4096
read bytes from fs: 4096,   read bytes from gs: 665
read bytes from fs: 4096,   read bytes from gs: 4096
read bytes from fs: 4096,   read bytes from gs: 4096
read bytes from fs: 4096,   read bytes from gs: 4096
read bytes from fs: 4096,   read bytes from gs: 4096
read bytes from fs: 4096,   read bytes from gs: 4096
read bytes from fs: 4096,   read bytes from gs: 853
read bytes from fs: 4096,   read bytes from gs: 4096

I noticed, that the problem exists only using .NET6. With .net core 3.1 this example works properly:

static string GenerateContent()
{
    var rnd = new Random();
    var sb = new StringBuilder();

    for (int i = 0; i < 10000000; i++)
    {
        sb.Append(rnd.Next(0, 100));
    }

    return sb.ToString();
}

static void Compress(string input, string output)
{
    using (var originalFileStream = File.OpenRead(input))
    using (var compressedFileStream = File.OpenWrite(output))
    using (var compressor = new GZipStream(compressedFileStream, CompressionMode.Compress))
        originalFileStream.CopyTo(compressor);
}

static bool AreFilesEqual(string input, string gzip)
{
    var bytesToRead = 4096;

    var one = new byte[bytesToRead];
    var two = new byte[bytesToRead];

    using (var gs = new GZipStream(File.OpenRead(gzip), CompressionMode.Decompress))
    using (var fs = File.OpenRead(input))
    {
        int file1byte;
        int file2byte;

        do
        {
            file1byte = fs.Read(one);
            file2byte = gs.Read(two);
        }
        while (one.SequenceEqual(two) && (file1byte != 0));

        return file1byte == file2byte && file1byte == 0;
    }
}

static void Main(string[] args)
{
    var input = @"c:\logs\input3.txt";
    var output = @"c:\logs\example3.gz";

    // create input
    File.WriteAllText(input, GenerateContent());

    // compress input
    Compress(input, output);

    // compare files
    var areFilesEqual = AreFilesEqual(input, output);

    // .NET 6.0 -> files aren't equal
    // .NET core 3.1 -> files are equal
}

Seems like Read does not always return requested amount of bytes. I created simple extension that forces missing bytes to be read:

public static class Extensions
{
    public static int ForceRead(this Stream fs, Span<byte> buffer)
    {
        var totalReadBytes = 0;

        do
        {
            var readBytes = fs.Read(buffer, totalReadBytes, buffer.Length - totalReadBytes);

            if (readBytes == 0)
                return totalReadBytes;

            totalReadBytes += readBytes;
        }
        while (totalReadBytes < buffer.Length);

        return totalReadBytes;
    }
}

You can avoid reading all bytes of the archive file, but you cant avoid having to read all bytes of the new file. — Jamiec, Jun 13 '22 at 12:42
Not as long as you use GZip. That makes ZipArchiveEntry.LastWriteTime attractive, although it is only accurate to 2 seconds. — Hans Passant, Jun 13 '22 at 12:42

Stig · Answer 1 · 2022-06-13T13:10:25.580

0

First check length, if they differ return false.

Then compare a chunk at a time e.g. 32Kb. Return false on first different chunk. Allocate 2 byte arrays for the chunks and reuse these arrays. So you implementation only has 2 chunks in memory at a time.

Use GZipStream to decompress a chunk at a time

edited Jun 13 '22 at 13:10

answered Jun 13 '22 at 12:38

Stig

1,974
2
23
50

@BartłomiejStasiak I assume the files are new every time. I. You assume that the files are compared again and again and comparison therefor can use a checksum. Also you assume that the files are not changed by another process, and that a checksum once calculated is static. – Stig Jun 13 '22 at 13:01
beware of using the gzip internal CRC-32 checksum. The collision rate with a 32bit checksum might be way to high for your use case. – Stig Jun 13 '22 at 13:30
Doesn't `SequenceEqual` works the same as reading in chunks? If files are different at beginning, we can skip checking rest data – dafie Jun 13 '22 at 13:42
@dafie no it is not the same, because both ReadAllBytes and Decompress have loaded the whole file into memory. – Stig Jun 13 '22 at 13:48
Use FileStream and GZipStream (and notice the word Stream) – Stig Jun 13 '22 at 13:49
I've updated question. I went with your suggestions and new version is 315x faster! – dafie Jun 13 '22 at 15:22
The only problem is that I cant check `Length` on `GZipStream` – dafie Jun 13 '22 at 15:29
excellent work. Have you tried new FileInfo(archive).Length? – Stig Jun 13 '22 at 15:53
You have a bug. BitConverter.ToInt64 only compare first 8 bytes. You should compare until file1byte (which you should call readBytesFromFile1) – Stig Jun 13 '22 at 16:07
you're right! Unfortunately I found another problem - sometimes reading from `GZipStream` does not return 4096 bytes but less. I edited question content. Do you know why that happen? – dafie Jun 13 '22 at 16:30
I noticed, that the problem occurs in .net 6.0, not with .net core 3.1. I edited question with proper example – dafie Jun 13 '22 at 16:47
This is by design, you should always check the returned read bytes. Create a helper method like in the accepted answer here https://stackoverflow.com/questions/221925/creating-a-byte-array-from-a-stream – Stig Jun 13 '22 at 19:39
Ok, I updated question. Do you think it is ok now? – dafie Jun 13 '22 at 21:30

score 0 · Answer 2 · answered Jun 14 '22 at 01:52

var one = new byte[bytesToRead];
var two = new byte[bytesToRead];
do
{
    file1byte = fs.Read(one);
    file2byte = gs.Read(two);
}
while (one.SequenceEqual(two) && (file1byte != 0));

For some reason, sometimes I do not read full 4096 bytes from GZipStream:

That's not a bug, that's a feature. .Read is supposed to return as much data as possible on every call, without blocking your program unnecessarily. That way your program can make some progress without delay.

The simple answer is to keep reading until the buffer is full, or EOF;


do
{
    file1byte = ReadAll(fs, one);
    file2byte = ReadAll(gs, two);
}
while (one.SequenceEqual(two) && (file1byte != 0));


int ReadAll(Stream s, byte[] buff)
{
    var offset = 0;
    var read = 0;
    do
    {
        read = s.Read(buff, offset, buff.Length - offset);
        offset += read;
    } while (read > 0 && offset < buff.Length);
    return offset;
}

It seems I had the same idea (look at the `ForceRead` method) — dafie, Jun 14 '22 at 07:08

score -3 · Answer 3 · answered Jun 13 '22 at 12:35

-3

Yes, you can compare checksums of these files. It involves keeping somewhere (I guess IMemoryCache will be nice place for it) these checksums.

answered Jun 13 '22 at 12:35

Bartłomiej Stasiak

408
2
10

Compare file with archived file

Edit

3 Answers3