6

I have followed Microsoft's recommended way to unzip a .gz file :

https://learn.microsoft.com/en-us/dotnet/api/system.io.compression.gzipstream?view=netcore-3.1

I am trying to download and parse files from the CommonCrawl. I can successfully download them, and unzip them with 7-zip

However, in c# I get:

System.IO.InvalidDataException: 'The archive entry was compressed using an unsupported compression method.'

public static void Decompress(FileInfo fileToDecompress)
        {
            using (FileStream originalFileStream = fileToDecompress.OpenRead())
            {
                string currentFileName = fileToDecompress.FullName;
                string newFileName = currentFileName.Remove(currentFileName.Length - fileToDecompress.Extension.Length);

                using (FileStream decompressedFileStream = File.Create(newFileName))
                {
                    using (GZipStream decompressionStream = new GZipStream(originalFileStream, CompressionMode.Decompress))
                    {
                        decompressionStream.CopyTo(decompressedFileStream);
                        Console.WriteLine($"Decompressed: {fileToDecompress.Name}");
                    }
                }
            }
        }

The file is from there:

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-16/segments/1585370490497.6/wet/CC-MAIN-20200328074047-20200328104047-00010.warc.wet.gz

Does anyone know what the problem could be? do I need a special library?

Burf2000
  • 5,001
  • 14
  • 58
  • 117
  • 1
    I was able to decompress the file using your code, but only got a dozen of so lines of text where 7-zip game me ~500mb of stuff starting with the same dozen lines. Not sure why. – Retired Ninja Apr 26 '20 at 20:24
  • See https://stackoverflow.com/questions/47743788/gzipstream-from-memorystream-only-returns-a-few-hundred-bytes - seems to be a bug. – Sebastian Nagel Apr 29 '20 at 13:52

2 Answers2

4

I am not sure what the issue is but after reading this post

Decompressing using GZipStream returns only the first line

I changed to SharZipLib (http://www.icsharpcode.net/opensource/sharpziplib/) and it worked

Burf2000
  • 5,001
  • 14
  • 58
  • 117
1

I took another look at that source file and it appears to be a large number (52,593) of gzip streams concatenated together. Apparently legal according to the spec but it would seem GZipStream doesn't handle that well. Glad you got it working!

Retired Ninja
  • 4,785
  • 3
  • 25
  • 35