75

I'm working on the Google Cloud Storage .NET client library. There are three features (between .NET, my client library, and the Storage service) that are combining in an unpleasant way:

  • When downloading files (objects in Google Cloud Storage terminology), the server includes a hash of the stored data. My client code then validates that hash against the data it's downloaded.

  • A separate feature of Google Cloud Storage is that the user can set the Content-Encoding of the object, and that's included as a header when downloading, when the request contains a matching Accept-Encoding. (For the moment, let's ignore the behavior when the request doesn't include that...)

  • HttpClientHandler can decompress gzip (or deflate) content automatically and transparently.

When all three of these are combined, we get into trouble. Here's a short but complete program demonstrating that, but without using my client library (and hitting a publicly accessible file):

using System;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Security.Cryptography;
using System.Text;
using System.Threading.Tasks;

class Program
{
    static async Task Main()
    {
        string url = "https://www.googleapis.com/download/storage/v1/b/"
            + "storage-library-test-bucket/o/gzipped-text.txt?alt=media";
        var handler = new HttpClientHandler
        {
            AutomaticDecompression = DecompressionMethods.GZip
        };
        var client = new HttpClient(handler);

        var response = await client.GetAsync(url);
        byte[] content = await response.Content.ReadAsByteArrayAsync();
        string text = Encoding.UTF8.GetString(content);
        Console.WriteLine($"Content: {text}");

        var hashHeader = response.Headers.GetValues("X-Goog-Hash").FirstOrDefault();
        Console.WriteLine($"Hash header: {hashHeader}");

        using (var md5 = MD5.Create())
        {
            var md5Hash = md5.ComputeHash(content);
            var md5HashBase64 = Convert.ToBase64String(md5Hash);
            Console.WriteLine($"MD5 of content: {md5HashBase64}");
        }
    }
}

.NET Core project file:

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>netcoreapp2.0</TargetFramework>
    <LangVersion>7.1</LangVersion>
  </PropertyGroup>
</Project>

Output:

Content: hello world
Hash header: crc32c=T1s5RQ==,md5=xhF4M6pNFRDQnvaRRNVnkA==
MD5 of content: XrY7u+Ae7tCTyyK7j1rNww==

As you can see, the MD5 of the content isn't the same as the MD5 part of the X-Goog-Hash header. (In my client library I'm using the crc32c hash, but that shows the same behavior.)

This isn't a bug in HttpClientHandler - it's expected, but a pain when I want to validate the hash. Basically, I need to at the content before and after decompression. And I can't find any way of doing that.

To clarify my requirements somewhat, I know how to prevent the decompression in HttpClient and instead decompress afterwards when reading from the stream - but I need to be able to do this without changing any the code that uses the resulting HttpResponseMessage from the HttpClient. (There's a lot of code that deals with responses, and I want to only make the change in one central place.)

I have a plan, which I've prototyped and which works as far as I've found so far, but is a bit ugly. It involves creating a three-layer handler:

  • HttpClientHandler with automatic decompression disabled.
  • A new handler which replaces the content stream with a new Stream subclass which delegates to the original content stream, but hashes the data as it's read.
  • A decompression-only handler, based on the Microsoft DecompressionHandler code.

While this works, it has disadvantages of:

  • Open source licensing: checking exactly what I need to do in order to create a new file in my repo based on the MIT-licensed Microsoft code
  • Effectively forking the MS code, which means I should probably make a regular check to see if any bugs have been found in it
  • The Microsoft code uses internal members of the assembly, so it doesn't port as cleanly as it might.

If Microsoft made DecompressionHandler public, that would help a lot - but that's likely to be in a longer timeframe than I need.

What I'm looking for is an alternative approach if possible - something I've missed that lets me get at the content before decompression. I don't want to reinvent HttpClient - the response is often chunked for example, and I don't want to have to get into that side of things. It's a pretty specific interception point that I'm looking for.

Whit Waldo
  • 4,806
  • 4
  • 48
  • 70
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • It sounds to me as this compression part here, in terms of the storage side, is sort of like this "I really have an uncompressed file, but it would be nice if I can store it compressed and have the decompression part of the browser decompress it automatically". If so, wouldn't it make sense to store/serve the hash of the decompressed content instead? It sounds like this is just a server space and cpu optimization, avoid the compression step on the server side. What am I missing here? Won't a lot of client libraries have the exact same problem due to this? – Lasse V. Karlsen Nov 16 '17 at 08:17
  • @LasseVågsætherKarlsen: It would be nice if the response could contain both the hash of the compressed data *and* the uncompressed data (you wouldn't want clients to have to decompress it just for hashing if they wanted to keep it compressed otherwise) but I doubt that I'll be able to get that change through. And yes, some other client libraries probably do have the same problem - but I'm in touch with the maintainers of the official Google ones, and they're checking it :) – Jon Skeet Nov 16 '17 at 08:19
  • Another question (that I throw out there, this is more of a question against the client handler, not towards your code) is why the handler implementation disregards setting the automatic decompression to none, it decompresses just the same. – Lasse V. Karlsen Nov 16 '17 at 08:22
  • 1
    @LasseVågsætherKarlsen: If you were still fetching from GCS, that's not `HttpClientHandler` doing it - that's GCS. If you ask for a file with a Content-Encoding of gzip but you don't specify Accept-Encoding: gzip, it decompresses it for you, serving the decompressed content with no Content-Encoding header. (And still includes the hash of the compressed file. I know, it's problematic... I didn't want to get into *all* the possible quirks in this question, but let me know if you think I should mention that.) – Jon Skeet Nov 16 '17 at 08:24
  • Well, the point of my questions was that if the server goes to this much trouble to make it difficult, wouldn't this whole problem be better served with a issue-request towards the server instead? It seems like this process is doomed to fail, if you're given a decompressed file, and have to guess at the compression parameters (or *worse*, use undocumented information) to try to compress it client-side in the hopes of getting the same original content just to verify the hash, this sounds like a maldesigned situation to begin with. – Lasse V. Karlsen Nov 16 '17 at 08:28
  • 4
    Put simply, it seems this hash is **designed** to be **unverifiable**, which sounds kinda pointless to me. – Lasse V. Karlsen Nov 16 '17 at 08:29
  • Is the file compressed by the server when storing it, so that at least the compression parameters are known and fixed? Or is the compressed file provided by the uploader, in compressed form, compressed by whatever favorite tool that person is using? If so, then it seems like this is a no-win situation. – Lasse V. Karlsen Nov 16 '17 at 08:31
  • @LasseVågsætherKarlsen: I think you have a point for the "server-side decompression" part, and I may be able to get that changed, but I think it's not entirely unreasonable for the hash to be "the hash of the content as it's served" - so the case I'm looking at (when we *are* specifying Accept-Encoding) is one that I think can be handled client-side. (Note that just changing the hash to be the hash of the decompressed content would cause some existing working clients to fail.) – Jon Skeet Nov 16 '17 at 08:31
  • @LasseVågsætherKarlsen: The file is compressed by the uploading user (and the Content-Encoding specified by them too). – Jon Skeet Nov 16 '17 at 08:31
  • Well, then your question will only be viable in the context of asking for the compressed file, as you've stated, and then know that if you ask for the already-decompressed version you're gambling on the compression parameters. – Lasse V. Karlsen Nov 16 '17 at 08:32
  • Are you running the code on Windows? Seems with .net core 2.0 the team decided to dump the ability to use the managed handler on windows, so you're always using this interop class: https://github.com/dotnet/corefx/blob/93ee4ba40c82d5aca978447cb3e14c4ef7e7fd53/src/Common/src/System/Net/Http/HttpHandlerDefaults.cs And while i am uncertain, it seems that any decompressing happens down there. – zaitsman Nov 16 '17 at 08:54
  • According to MSDN here: https://msdn.microsoft.com/en-us/library/windows/desktop/aa384066(v=vs.85).aspx Winhttp.dll supports only three options: Gzip, Deflate and All. And when you set `None` in your c# code on windows, it seems that effectively means `All`. – zaitsman Nov 16 '17 at 08:56
  • @LasseVågsætherKarlsen: Yes, I have a plan for situations where users explicitly turn off client-side decompression, but I didn't want to go into too much detail here. – Jon Skeet Nov 16 '17 at 09:04
  • @zaitsman: The defaults have definitely been changing, and there may be no way of explicitly opting into the managed handler, but the handler code itself should still work fine. I haven't seen any evidence that "none" means "all" on Windows... but it's easy to be confused due to the server-side decompression. – Jon Skeet Nov 16 '17 at 09:05
  • @JonSkeet I wasn't able to find the source code of winhttp.dll, that'd be the definitive answer. The way the code in `WinHttpHandler` is supplied (see here: https://github.com/dotnet/corefx/blob/a92474e2f5282fc2ac81c4f6d703b6d2f5248bac/src/System.Net.Http.WinHttpHandler/src/System/Net/Http/WinHttpHandler.cs) is that it will call the `SetWinHttpOption` even if you didn't specify the value for this. The enumeration (https://fossies.org/linux/ldc/runtime/druntime/src/core/sys/windows/winhttp.d) seems to only allow values 1,2 or 3. It's unclear what happens when they pass 0 there. – zaitsman Nov 16 '17 at 09:09
  • 1
    @zaitsman: I'd generally trust what I see on the wire even more than the source code :) I've been running most of my tests on .NET Core, but on Windows - and that's definitely able to disable compression. – Jon Skeet Nov 16 '17 at 09:10
  • I don't think it's possible in all cases, because on windows for example that will use native WinHttp calls, and when you set `AutomaticDecompression = DecompressionMethods.GZip` - decompression will be performed by winhttp itself, preventing you from somehow intercepting raw stream – Evk Nov 16 '17 at 09:32
  • Have you checked how it is done in Azure Storage SDK? Maybe the did the same. I quickly checked the code and there is something that handles the decompression +MD5 - https://github.com/Azure/azure-storage-net/blob/72c4cb3d7deff16ecc355a848a8476218b8e0555/Lib/ClassLibraryCommon/Blob/CloudBlob.cs#L3143 – Ondra Nov 16 '17 at 13:20
  • @Ondra: That line is about decryption rather than decompression. I suspect that the hash is propagated in a different way in Azure Storage. (It's a very specific situation in this case.) – Jon Skeet Nov 16 '17 at 13:26
  • Yes, DecompressionHandler being internal is a bummer. I guess you could always create an instance of it via reflection :/ – Eren Ersönmez Nov 16 '17 at 20:57
  • I don't know too much about this, just suggesting: How about sniffing the network like [this](https://stackoverflow.com/a/12437794/5976576) – MotKohn Nov 22 '17 at 15:59
  • @MotKohn: That would involve rewriting the complete HTTP stack. (Even if I could just intercept without handling the data, I'd need to *understand* all the data, e.g. where the headers ended, how chunked encoding was handled.) – Jon Skeet Nov 22 '17 at 16:04
  • Why can't you just "sniff", in order to get the compressed data, and use the regular HTTP at the same time? Oh you edited. didn't see. – MotKohn Nov 22 '17 at 16:06

3 Answers3

16

Looking at what @Michael did gave me the hint I was missing. After getting the compressed content you can use CryptoStream, and GZipStream, and StreamReader to read the response without loading it into memory more than needed. CryptoStream will hash the compressed content as it is decompressed and read. Replace the StreamReader with a FileStream and you can write the data to a file with minimal memory usage :)

using System;
using System.IO;
using System.IO.Compression;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Security.Cryptography;
using System.Text;
using System.Threading.Tasks;

class Program
{
    static async Task Main()
    {
        string url = "https://www.googleapis.com/download/storage/v1/b/"
            + "storage-library-test-bucket/o/gzipped-text.txt?alt=media";
        var handler = new HttpClientHandler
        {
            AutomaticDecompression = DecompressionMethods.None
        };
        var client = new HttpClient(handler);
        client.DefaultRequestHeaders.Add("Accept-Encoding", "gzip");

        var response = await client.GetAsync(url);
        var hashHeader = response.Headers.GetValues("X-Goog-Hash").FirstOrDefault();
        Console.WriteLine($"Hash header: {hashHeader}");
        string text = null;
        using (var md5 = MD5.Create())
        {
            using (var cryptoStream = new CryptoStream(await response.Content.ReadAsStreamAsync(), md5, CryptoStreamMode.Read))
            {
                using (var gzipStream = new GZipStream(cryptoStream, CompressionMode.Decompress))
                {
                    using (var streamReader = new StreamReader(gzipStream, Encoding.UTF8))
                    {
                        text = streamReader.ReadToEnd();
                    }
                }
                Console.WriteLine($"Content: {text}");
                var md5HashBase64 = Convert.ToBase64String(md5.Hash);
                Console.WriteLine($"MD5 of content: {md5HashBase64}");
            }
        }
    }
}

Output:

Hash header: crc32c=T1s5RQ==,md5=xhF4M6pNFRDQnvaRRNVnkA==
Content: hello world
MD5 of content: xhF4M6pNFRDQnvaRRNVnkA==

V2 of Answer

After reading Jon's response and an updated answer I have the following version. Pretty much the same idea, but I moved the streaming into a special HttpContent that I inject. Not exactly pretty but the idea is there.

using System;
using System.IO;
using System.IO.Compression;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Security.Cryptography;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

class Program
{
    static async Task Main()
    {
        string url = "https://www.googleapis.com/download/storage/v1/b/"
            + "storage-library-test-bucket/o/gzipped-text.txt?alt=media";
        var handler = new HttpClientHandler
        {
            AutomaticDecompression = DecompressionMethods.None
        };
        var client = new HttpClient(new Intercepter(handler));
        client.DefaultRequestHeaders.Add("Accept-Encoding", "gzip");

        var response = await client.GetAsync(url);
        var hashHeader = response.Headers.GetValues("X-Goog-Hash").FirstOrDefault();
        Console.WriteLine($"Hash header: {hashHeader}");
        HttpContent content1 = response.Content;
        byte[] content = await content1.ReadAsByteArrayAsync();
        string text = Encoding.UTF8.GetString(content);
        Console.WriteLine($"Content: {text}");
        var md5Hash = ((HashingContent)content1).Hash;
        var md5HashBase64 = Convert.ToBase64String(md5Hash);
        Console.WriteLine($"MD5 of content: {md5HashBase64}");
    }

    public class Intercepter : DelegatingHandler
    {
        public Intercepter(HttpMessageHandler innerHandler) : base(innerHandler)
        {
        }

        protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
        {
            var response = await base.SendAsync(request, cancellationToken);
            response.Content = new HashingContent(await response.Content.ReadAsStreamAsync());
            return response;
        }
    }

    public sealed class HashingContent : HttpContent
    {
        private readonly StreamContent streamContent;
        private readonly MD5 mD5;
        private readonly CryptoStream cryptoStream;
        private readonly GZipStream gZipStream;

        public HashingContent(Stream content)
        {
            mD5 = MD5.Create();
            cryptoStream = new CryptoStream(content, mD5, CryptoStreamMode.Read);
            gZipStream = new GZipStream(cryptoStream, CompressionMode.Decompress);
            streamContent = new StreamContent(gZipStream);
        }

        protected override Task SerializeToStreamAsync(Stream stream, TransportContext context) => streamContent.CopyToAsync(stream, context);
        protected override bool TryComputeLength(out long length)
        {
            length = 0;
            return false;
        }

        protected override Task<Stream> CreateContentReadStreamAsync() => streamContent.ReadAsStreamAsync();

        protected override void Dispose(bool disposing)
        {
            try
            {
                if (disposing)
                {
                    streamContent.Dispose();
                    gZipStream.Dispose();
                    cryptoStream.Dispose();
                    mD5.Dispose();
                }
            }
            finally
            {
                base.Dispose(disposing);
            }
        }

        public byte[] Hash => mD5.Hash;
    }
}
shmuelie
  • 1,205
  • 1
  • 9
  • 24
  • That would be fine if my code were all reading the data - but it's not. (Or at least, it's doing so in very different places.) I really need to keep the API the same, using HttpClient and just intercepting the data as it's read :( I'll edit the question when I get the chance to make the requirements clearer. – Jon Skeet Nov 21 '17 at 21:32
  • @JonSkeet you are a tricky customer! Think I got it this time :) – shmuelie Nov 21 '17 at 22:21
  • Right, this is now effectively the workaround I described, except without the separation between hashing and decompression - and without the header copying that DecompressionHandler does. I'm glad we ended up at roughly the same place, even if it isn't as uninvasive as I hoped. – Jon Skeet Nov 22 '17 at 10:28
  • Important difference is I don't use anything internal :) – shmuelie Nov 22 '17 at 18:56
  • @shmulie: Yes, but by reimplementing bits - just as I'm planning to do. (With headers etc too.) – Jon Skeet Nov 22 '17 at 19:34
5

I managed to get the headerhash correct by:

  • creating a custom handler that inherits HttpClientHandler
  • overriding SendAsync
  • read as byte the response using base.SendAsync
  • Compress it using GZipStream
  • Hashing the Gzip Md5 to base64 (using your code)

this issue is, as you said "before decompression" is not really respected here

The idea is to get this if working as you would like https://github.com/dotnet/corefx/blob/master/src/System.Net.Http.WinHttpHandler/src/System/Net/Http/WinHttpResponseParser.cs#L80-L91

it matches

class Program
{
    const string url = "https://www.googleapis.com/download/storage/v1/b/storage-library-test-bucket/o/gzipped-text.txt?alt=media";

    static async Task Main()
    {
        //await HashResponseContent(CreateHandler(DecompressionMethods.None));
        //await HashResponseContent(CreateHandler(DecompressionMethods.GZip));
        await HashResponseContent(new MyHandler());

        Console.ReadLine();
    }

    private static HttpClientHandler CreateHandler(DecompressionMethods decompressionMethods)
    {
        return new HttpClientHandler { AutomaticDecompression = decompressionMethods };
    }

    public static async Task HashResponseContent(HttpClientHandler handler)
    {
        //Console.WriteLine($"Using AutomaticDecompression : '{handler.AutomaticDecompression}'");
        //Console.WriteLine($"Using SupportsAutomaticDecompression : '{handler.SupportsAutomaticDecompression}'");
        //Console.WriteLine($"Using Properties : '{string.Join('\n', handler.Properties.Keys.ToArray())}'");

        var client = new HttpClient(handler);

        var response = await client.GetAsync(url);
        byte[] content = await response.Content.ReadAsByteArrayAsync();
        string text = Encoding.UTF8.GetString(content);
        Console.WriteLine($"Content: {text}");

        var hashHeader = response.Headers.GetValues("X-Goog-Hash").FirstOrDefault();
        Console.WriteLine($"Hash header: {hashHeader}");
        byteArrayToMd5(content);

        Console.WriteLine($"=====================================================================");
    }

    public static string byteArrayToMd5(byte[] content)
    {
        using (var md5 = MD5.Create())
        {
            var md5Hash = md5.ComputeHash(content);
            return Convert.ToBase64String(md5Hash);
        }
    }

    public static byte[] Compress(byte[] contentToGzip)
    {
        using (MemoryStream resultStream = new MemoryStream())
        {
            using (MemoryStream contentStreamToGzip = new MemoryStream(contentToGzip))
            {
                using (GZipStream compressionStream = new GZipStream(resultStream, CompressionMode.Compress))
                {
                    contentStreamToGzip.CopyTo(compressionStream);
                }
            }

            return resultStream.ToArray();
        }
    }
}

public class MyHandler : HttpClientHandler
{
    protected override async Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
    {
        var response = await base.SendAsync(request, cancellationToken);
        var responseContent = await response.Content.ReadAsByteArrayAsync().ConfigureAwait(false);

        Program.byteArrayToMd5(responseContent);

        var compressedResponse = Program.Compress(responseContent);
        var compressedResponseMd5 = Program.byteArrayToMd5(compressedResponse);

        Console.WriteLine($"recompressed response to md5 : {compressedResponseMd5}");

        return response;
    }
}
  • 4
    That's worked because I *happened* to use .NET to gzip the content of that file to start with, using the default settings. But there are several different ways of gzipping content, which would end up creating different hashes. If gzip were stable (i.e. compressing the same input always gave the same output) this would be feasible - but it won't work for this case :( – Jon Skeet Nov 16 '17 at 16:23
  • 1
    This is pretty weird since default value (and snooping while debugging) seems to call that `if` statement with `false` so it should NOT decompress in fact https://user-images.githubusercontent.com/2266487/32904816-8d42e9c6-caf8-11e7-8d48-0dae061a3772.png – Alexandre Hgs Nov 16 '17 at 17:00
  • 1
    If the library doesn't send the Accept-Encoding, the server decompresses the content on the fly. I suspect that's what's happening in this case - you're then recompressing it using the same settings as the original compression, so you end up with the same hash. – Jon Skeet Nov 16 '17 at 17:05
5

What about disabling automatic decompression, manually adding the Accept-Encoding header(s) and then decompressing after hash verification?

private static async Task Test2()
{
    var url = @"https://www.googleapis.com/download/storage/v1/b/storage-library-test-bucket/o/gzipped-text.txt?alt=media";
    var handler = new HttpClientHandler
    {
        AutomaticDecompression = DecompressionMethods.None
    };
    var client = new HttpClient(handler);
    client.DefaultRequestHeaders.Add("Accept-Encoding", "gzip");

    var response = await client.GetAsync(url);
    var raw = await response.Content.ReadAsByteArrayAsync();

    var hashHeader = response.Headers.GetValues("X-Goog-Hash").FirstOrDefault();
    Debug.WriteLine($"Hash header: {hashHeader}");

    bool match = false;
    using (var md5 = MD5.Create())
    {
        var md5Hash = md5.ComputeHash(raw);
        var md5HashBase64 = Convert.ToBase64String(md5Hash);
        match = hashHeader.EndsWith(md5HashBase64);
        Debug.WriteLine($"MD5 of content: {md5HashBase64}");
    }

    if (match)
    {
        var memInput = new MemoryStream(raw);
        var gz = new GZipStream(memInput, CompressionMode.Decompress);
        var memOutput = new MemoryStream();
        gz.CopyTo(memOutput);
        var text = Encoding.UTF8.GetString(memOutput.ToArray());
        Console.WriteLine($"Content: {text}");
    }
}
Michael
  • 1,931
  • 2
  • 8
  • 22
  • 2
    This is basically a simpler but less efficient version of my prototype. The problem is that it keeps the whole stream in memory - when these files can be multiple gigabytes. I need to insert the hashing within the stream returned from the content :( – Jon Skeet Nov 18 '17 at 10:30
  • 3
    If we're talking about gigabytes then this approch is unusable, sry :( – Michael Nov 18 '17 at 11:52