C# parsing of Freebase RDF dump yields only 11.5 million N-Triples instead of 1.9 billion

Question

I'm working on building a C# program to read the RDF data in the Google Freebase data dump. To start out, I've written a simple loop to simply read the file and get a count of the Triples. However, instead of getting the 1.9 billion count as stated in the documentation page (referred above), my program is counting only about 11.5 million and then exiting. The relevant portion of the source code is given below (takes about 30 seconds to run).

What am I missing here?

// Simple reading through the gz file
try
{
    using (FileStream fileToDecompress = File.Open(@"C:\Users\Krishna\Downloads\freebase-rdf-2014-02-16-00-00.gz", FileMode.Open))
    {
        int tupleCount = 0;
        string readLine = "";

        using (GZipStream decompressionStream = new GZipStream(fileToDecompress, CompressionMode.Decompress))
        {
            StreamReader sr = new StreamReader(decompressionStream, detectEncodingFromByteOrderMarks: true);

            while (true)
            {
                readLine = sr.ReadLine();
                if (readLine != null)
                {
                    tupleCount++;
                    if (tupleCount % 1000000 == 0)
                    { Console.WriteLine(DateTime.Now.ToShortTimeString() + ": " + tupleCount.ToString()); }
                }
                else
                { break; }
            }
            Console.WriteLine("Tuples: " + tupleCount.ToString());
        }
    }
}
catch (Exception ex)
{ Console.WriteLine(ex.Message); }

(I tried using GZippedNTriplesParser in dotNetRdf to read the data by building on this recommendation, but that seems to be choking on an RdfParseException right at the beginning (Tab delimiters? UTF-8??). So, for the moment, trying to roll my own).

A bug report to the dotNetRDF mailing list or the issue tracker wrt the parser choking on the Freebase output would be appreciated — RobV, Feb 19 '14 at 09:41

score 2 · Accepted Answer · edited May 23 '17 at 12:07

2

The Freebase RDF dumps are built by a map/reduce job that outputs 200 individual Gzip files. Those 200 files are then concatenated into one final Gzip file. According to the Gzip spec, concatenating the raw bytes from multiple Gzip files will produce a valid Gzip file. A library that adheres to the spec should produce a single file with concatenated content of each input file when uncompressing that file.

Based on the number of triples that you're seeing, I'm guessing that your code is only uncompressing the first chunk of the file and ignoring the other 199. I'm not much of a C# programmer but from reading another Stackoverflow answer it seems like switching to DotNetZip will solve this problem.

edited May 23 '17 at 12:07

Community

1
1

answered Feb 19 '14 at 02:22

Shawn Simister

4,613
1
26
31

Shawn thanks for pointing out what I was missing! The 200 files seems to explain it. I experimented with DotNetZip (and read their documentation) following your suggestion, but have so far not been able to make it work to produce a combined all-files Stream from which I can do things such as ReadLine(). It seems to read only the first file, similar to the .NET GZipStream. It apparently allows it more easily for zip (not gzip) files. I'll continue more experiments; in case anybody can post any relevant code samples, that'll be super helpful! – Krishna Gupta Feb 20 '14 at 05:47

score 1 · Answer 2 · edited Apr 09 '16 at 16:58

1

I'm use DotNetZip and create decoration class GzipDecorator for "gzipped chunks" workaround.

sealed class GzipDecorator : Stream
{
    private readonly Stream _readStream;
    private GZipStream _gzip;
    private long _totalIn;
    private long _totalOut;

    public GzipDecorator(Stream readStream)
    {
        Throw.IfArgumentNull(readStream, "readStream");
        _readStream = readStream;
        _gzip = new GZipStream(_readStream, CompressionMode.Decompress, true);
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        var bytesRead = _gzip.Read(buffer, offset, count);
        if (bytesRead <= 0 && _readStream.Position < _readStream.Length)
        {
            _totalIn += _gzip.TotalIn + 18;
            _totalOut += _gzip.TotalOut;
            _gzip.Dispose();
            _readStream.Position = _totalIn;
            _gzip = new GZipStream(_readStream, CompressionMode.Decompress, true);
            bytesRead = _gzip.Read(buffer, offset, count);
        }
        return bytesRead;
    }
}

edited Apr 09 '16 at 16:58

Eonasdan

7,563
8
55
82

answered Apr 06 '15 at 16:59

Andrew Ivanov

11
1

you have explain how above answer address the question – Heshan Sandeepa Apr 06 '15 at 17:22
Libs for ungzip concatenated gzip files doesn't work. They're ungzip only first file in sequence. You can read answer and comment before my answer with explanation about problem source. Simple switch to DotNetZip doesn't work. – Andrew Ivanov Apr 08 '15 at 16:59
I've same problem with read from gzipped nginx logs - only small piece of log was read by common libs. And my searches brings me to questions without answers. And to bug reports from year 2002 with resolve "this is not a bug". – Andrew Ivanov Apr 08 '15 at 17:17
I tried this example, but got "Bad GZIP header" exception – Ivan Marinin Oct 28 '15 at 08:00

score 0 · Answer 3 · answered Oct 29 '15 at 07:42

0

I managed to solve the problem by repacking dump using "7-zip" archiver. Maybe it helps you.

answered Oct 29 '15 at 07:42

Ivan Marinin

128
3
9

C# parsing of Freebase RDF dump yields only 11.5 million N-Triples instead of 1.9 billion

3 Answers3

Linked