1

This is my code:

static string GenerateContent()
{
    var rnd = new Random();
    var sb = new StringBuilder();

    for (int i = 0; i < 10000000; i++)
    {
        sb.Append(rnd.Next(0, 100));
    }

    return sb.ToString();
}

static void Compress(string input, string output)
{
    using (var originalFileStream = File.OpenRead(input))
    using (var compressedFileStream = File.OpenWrite(output))
    using (var compressor = new GZipStream(compressedFileStream, CompressionMode.Compress))
        originalFileStream.CopyTo(compressor);
}

static bool AreFilesEqual_Chunk(string input, string gzip)
{
    var bytesToRead = 4096;

    var one = new byte[bytesToRead];
    var two = new byte[bytesToRead];

    using (var gs = new GZipStream(File.OpenRead(gzip), CompressionMode.Decompress))
    using (var fs = File.OpenRead(input))
    {
        int file1byte;
        int file2byte;

        do
        {
            file1byte = fs.ForceRead(one);
            file2byte = gs.ForceRead(two);
        }
        while (one.SequenceEqual(two) && (file1byte != 0));

        return file1byte == file2byte && file1byte == 0;
    }
}

static void Main(string[] args)
{
    var input = @"c:\logs\input.txt";
    var output = @"c:\logs\output.gz";

    // create input
    File.WriteAllText(input, GenerateContent());

    // compress input
    Compress(input, output);

    // compare files
    var areFilesEqual = AreFilesEqual_Chunk(input, output);

    Console.WriteLine(areFilesEqual);

    // .NET 6.0 -> files aren't equal
    // .NET core 3.1 -> files are equal
}

public static class Extensions
{
    public static int ForceRead(this Stream fs, byte[] buffer)
    {
        var totalReadBytes = 0;

        do
        {
            var readBytes = fs.Read(buffer, totalReadBytes, buffer.Length - totalReadBytes);

            if (readBytes == 0)
                return totalReadBytes;

            totalReadBytes += readBytes;
        }
        while (totalReadBytes < buffer.Length);

        return totalReadBytes;
    }
}

If I run this with .NET 6.0, then areFilesEqual is false. If I run this with .NET core 3.1, then areFilesEqual is true. For some reason, when reading bytes from GZipStream with .NET 6.0 I am not getting requested number of bytes (4096) all the time. Why does that happen? I noticed, that when I read bytes from GZipStream on .NET 6.0 sometimes I get less bytes than requested, for example:

4096
4096
4096
4096
770
4096
4096
4096
4096
4096
665

Edit

I finally get things working. I added method ForceRead to force missing bytes to be read.

dafie
  • 951
  • 7
  • 25
  • 1
    Your comparison algorithm is flawed. It assumes that `Read` will always return the requested number of bytes every time until the end. There's no guarantee that it will do that. – madreflection Jun 13 '22 at 17:01
  • @madreflection Shouldn't we be getting as many bytes as requested? This seems to be how it works with `.net core 3.1` – dafie Jun 13 '22 at 17:03
  • 2
    Nope. You get what it can give you at the time of that call, which can be whatever it decides at that point in time. – madreflection Jun 13 '22 at 17:04
  • Why don't you combine the entire file content and compare once, to avoid the `buffered read` behaviour!? – Anand Sowmithiran Jun 13 '22 at 17:07
  • @AnandSowmithiran you suggest to decompress whole file and then compare with input? This seems to be much slower than comparing chunks of bytes – dafie Jun 13 '22 at 17:09
  • 1
    Reading the whole file could be prohibitive for large files. The alternative is to do your own buffering. That's really what you should've done in the first place since you're dealing with a *stream*. – madreflection Jun 13 '22 at 17:11
  • 1
    @dafie that's funny how you prefer code you wrote for speed ignoring correctness to obviously correct code (read whole thing and compare)... (Indeed one can compare sequences without reading whole file, you just need to write it correctly) – Alexei Levenkov Jun 13 '22 at 17:11
  • @AlexeiLevenkov I am not ignoring correctness anywhere. That's why I asked a question to find out where the bug was – dafie Jun 13 '22 at 17:16
  • Requiring, for the algorithm to work, that it always reads the number of bytes requested except at the end of the stream does, in fact, ignoring correctness. You expected too much of the `Read` method's documented contract, so when it behaved in a way that's still consistent with the contract (but not how you've previously *observed*), it revealed the bug in your algorithm. It's also possible that .NET Core 3.1 might have occasionaly not behaved the way you observed. – madreflection Jun 13 '22 at 17:18
  • @madreflection So what's wrong with reading the data in chunks? If it's faster, and it only needs to be corrected, it looks like I should code it like this, rather than read the entire file at once, which may (I don't know this yet) be much slower. – dafie Jun 13 '22 at 17:22
  • @dafie I don't know what *you* are asking, but the question written here claims that Read works differently between frameworks and requires explanation "why".The question does not ask how to compare streams (or fix code). Unfortunately it is very hard to prove (and the question doesn't do that) because `Read` does not have to always fill the buffer ( "This can be less than the number of bytes allocated in the buffer if that many bytes are not currently available, or zero (0) if the end of the stream has been reached.) Showing that one implementation *always* fills the buffer is not practical. – Alexei Levenkov Jun 13 '22 at 17:23
  • @AlexeiLevenkov I already know that `Read` works differently than I assumed. What I want to figure out now is how to change the current code so that I can compare the two files in chunks. – dafie Jun 13 '22 at 17:25
  • *"how to change the current code"* - I refer you back to my [earlier comment](https://stackoverflow.com/questions/72606502/read-bytes-from-gzipstream-on-net-6-works-different-than-on-net-core-3-1#comment128255650_72606502). – madreflection Jun 13 '22 at 17:27
  • @madreflection What you wrote is just as helpful as saying: "just do it differently" – dafie Jun 13 '22 at 17:28
  • @dafie what you *want* and what your *wrote in the question* seem to be quite different. Consider either [edit] the question or ask a new one... (Note that we already have https://stackoverflow.com/questions/1358510/how-to-compare-2-files-fast-using-net which will be duplicate even if it does not have correct solution with reading buffers), so you may want to preemptively notice that in your updated post. – Alexei Levenkov Jun 13 '22 at 17:31
  • Only if you don't know how to manage buffers and have no intention of learning how to do so. It wasn't a block of code, but it's a reference to a specific technique that you can look up. – madreflection Jun 13 '22 at 17:32
  • @madreflection I finally get this working (and updated question) – dafie Jun 13 '22 at 19:34
  • The second call doesn't necessarily have to return that same number of bytes. You're making the same assumption as before, this time with a run-time request count, relying on that fact that *right now*, it's returning that much. Again, at any point, it doesn't have to return as many bytes as you requested, regardless of where you got that number. So while it did what you wanted, a later version (even the smallest patch, not just a full version) would reveal the same mistake in your code in the exact same way as it did this time. – madreflection Jun 13 '22 at 19:37
  • @madreflection I updated my code - now i force missing bytes to be read – dafie Jun 13 '22 at 21:17
  • 1
    So what exactly is your question now? – Charlieface Jun 13 '22 at 23:37
  • @Charlieface at this point I want to know if there is a more efficient solution – dafie Jun 14 '22 at 09:14
  • Is there a performance problem? It looks pretty efficient. The only things I can think of is a bug: you need to check ` && (file1byte != 0 && file2byte != 0)` and perhaps it's more performant to use spans. So you want `while (new ReadOnlySpan(one).SequenceEqual(new ReadOnlySpan(two)) && (file1byte != 0 || file2byte != 0));`. Or you could write a `for` loop in a function, that will probably be the fastest. Also if you think it's likely that they will be different lengths then probably faster to swap the `&&` conditions. – Charlieface Jun 14 '22 at 12:28
  • "The only things I can think of is a bug: you need to check `&& (file1byte != 0 && file2byte != 0)`". Hmm but if `one.SequenceEqual(two)` is `true`, which means, that both arrays are the same, which means, that `file1byte` is the same as `file2byte`, so we dont have to check `file2byte != 0`, because it would give the same result as `file1byte != 0` . – dafie Jun 14 '22 at 12:48

0 Answers0