36

I'm trying to read and parse a large JSON file that cannot fit in memory with the new JSON reader System.Text.Json in .NET Core 3.0.

The example code from Microsoft takes a ReadOnlySpan<byte> as input

    public static void Utf8JsonReaderLoop(ReadOnlySpan<byte> dataUtf8)
    {
        var json = new Utf8JsonReader(dataUtf8, isFinalBlock: true, state: default);

        while (json.Read())
        {
            JsonTokenType tokenType = json.TokenType;
            ReadOnlySpan<byte> valueSpan = json.ValueSpan;
            switch (tokenType)
            {
                case JsonTokenType.StartObject:
                case JsonTokenType.EndObject:
                    break;
                case JsonTokenType.StartArray:
                case JsonTokenType.EndArray:
                    break;
                case JsonTokenType.PropertyName:
                    break;
                case JsonTokenType.String:
                    string valueString = json.GetString();
                    break;
                case JsonTokenType.Number:
                    if (!json.TryGetInt32(out int valueInteger))
                    {
                        throw new FormatException();
                    }
                    break;
                case JsonTokenType.True:
                case JsonTokenType.False:
                    bool valueBool = json.GetBoolean();
                    break;
                case JsonTokenType.Null:
                    break;
                default:
                    throw new ArgumentException();
            }
        }

        dataUtf8 = dataUtf8.Slice((int)json.BytesConsumed);
        JsonReaderState state = json.CurrentState;
    }

What I'm struggling to find out is how to actually use this code with a FileStream, getting a FileStream into a ReadOnlySpan<byte>.

I tried reading the file using the following code and ReadAndProcessLargeFile("latest-all.json");

    const int megabyte = 1024 * 1024;
    public static void ReadAndProcessLargeFile(string theFilename, long whereToStartReading = 0)
    {
        FileStream fileStram = new FileStream(theFilename, FileMode.Open, FileAccess.Read);
        using (fileStram)
        {
            byte[] buffer = new byte[megabyte];
            fileStram.Seek(whereToStartReading, SeekOrigin.Begin);
            int bytesRead = fileStram.Read(buffer, 0, megabyte);
            while (bytesRead > 0)
            {
                ProcessChunk(buffer, bytesRead);
                bytesRead = fileStram.Read(buffer, 0, megabyte);
            }

        }
    }

    private static void ProcessChunk(byte[] buffer, int bytesRead)
    {
        var span = new ReadOnlySpan<byte>(buffer);
        Utf8JsonReaderLoop(span);
    }

It crashes with the error messaage

System.Text.Json.JsonReaderException: 'Expected end of string, but instead reached end of data. LineNumber: 8 | BytePositionInLine: 123335.'

As a reference, here is my working code that's using Newtonsoft.Json

        dynamic o;
        var serializer = new Newtonsoft.Json.JsonSerializer();
        using (FileStream s = File.Open("latest-all.json", FileMode.Open))
        using (StreamReader sr = new StreamReader(s))
        using (JsonReader reader = new JsonTextReader(sr))
        {
            while (reader.Read())
            {
                if (reader.TokenType == JsonToken.StartObject)
                {
                    o = serializer.Deserialize(reader);
                 }
            }
        }
dbc
  • 104,963
  • 20
  • 228
  • 340
J. Margarine
  • 397
  • 1
  • 3
  • 8
  • `ProcessChunk` doesn't use `bytesRead`. I think you also need to pass `state` from the previous `Utf8JsonReader` into the `Utf8JsonReader` ctor, and *correctly* indicate whether you're giving it the final block. – canton7 Mar 04 '19 at 12:47
  • Also, `Stream.Read` can take a `Span` as well as a `byte[]` – canton7 Mar 04 '19 at 13:06
  • So... why don't you use `Utf8JsonReader.Parse(Stream,JsonReaderOptions)`? I suppose, regardless of how you _feed_ the data, the question is whether the final resulting object fits in your memory. And if it does, the stream parser should work, too. – Mike Makarov Mar 04 '19 at 14:15
  • The JSON file is a dump of WikiData and is about 800GB. Each entity that I want to parse is small though, as described here https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON. I can't seem to find Utf8JsonReader.Parse though? – J. Margarine Mar 04 '19 at 14:31

3 Answers3

39

Update 2019-10-13: Rewritten the Utf8JsonStreamReader to use ReadOnlySequences internally, added wrapper for JsonSerializer.Deserialize method.


I have created a wrapper around Utf8JsonReader for exactly this purpose:

public ref struct Utf8JsonStreamReader
{
    private readonly Stream _stream;
    private readonly int _bufferSize;

    private SequenceSegment? _firstSegment;
    private int _firstSegmentStartIndex;
    private SequenceSegment? _lastSegment;
    private int _lastSegmentEndIndex;

    private Utf8JsonReader _jsonReader;
    private bool _keepBuffers;
    private bool _isFinalBlock;

    public Utf8JsonStreamReader(Stream stream, int bufferSize)
    {
        _stream = stream;
        _bufferSize = bufferSize;

        _firstSegment = null;
        _firstSegmentStartIndex = 0;
        _lastSegment = null;
        _lastSegmentEndIndex = -1;

        _jsonReader = default;
        _keepBuffers = false;
        _isFinalBlock = false;
    }

    public bool Read()
    {
        // read could be unsuccessful due to insufficient bufer size, retrying in loop with additional buffer segments
        while (!_jsonReader.Read())
        {
            if (_isFinalBlock)
                return false;

            MoveNext();
        }

        return true;
    }

    private void MoveNext()
    {
        var firstSegment = _firstSegment;
        _firstSegmentStartIndex += (int)_jsonReader.BytesConsumed;

        // release previous segments if possible
        if (!_keepBuffers)
        {
            while (firstSegment?.Memory.Length <= _firstSegmentStartIndex)
            {
                _firstSegmentStartIndex -= firstSegment.Memory.Length;
                firstSegment.Dispose();
                firstSegment = (SequenceSegment?)firstSegment.Next;
            }
        }

        // create new segment
        var newSegment = new SequenceSegment(_bufferSize, _lastSegment);

        if (firstSegment != null)
        {
            _firstSegment = firstSegment;
            newSegment.Previous = _lastSegment;
            _lastSegment?.SetNext(newSegment);
            _lastSegment = newSegment;
        }
        else
        {
            _firstSegment = _lastSegment = newSegment;
            _firstSegmentStartIndex = 0;
        }

        // read data from stream
        _lastSegmentEndIndex = _stream.Read(newSegment.Buffer.Memory.Span);
        _isFinalBlock = _lastSegmentEndIndex < newSegment.Buffer.Memory.Length;
        _jsonReader = new Utf8JsonReader(new ReadOnlySequence<byte>(_firstSegment, _firstSegmentStartIndex, _lastSegment, _lastSegmentEndIndex), _isFinalBlock, _jsonReader.CurrentState);
    }

    public T Deserialize<T>(JsonSerializerOptions? options = null)
    {
        // JsonSerializer.Deserialize can read only a single object. We have to extract
        // object to be deserialized into separate Utf8JsonReader. This incures one additional
        // pass through data (but data is only passed, not parsed).
        var tokenStartIndex = _jsonReader.TokenStartIndex;
        var firstSegment = _firstSegment;
        var firstSegmentStartIndex = _firstSegmentStartIndex;

        // loop through data until end of object is found
        _keepBuffers = true;
        int depth = 0;

        if (TokenType == JsonTokenType.StartObject || TokenType == JsonTokenType.StartArray)
            depth++;

        while (depth > 0 && Read())
        {
            if (TokenType == JsonTokenType.StartObject || TokenType == JsonTokenType.StartArray)
                depth++;
            else if (TokenType == JsonTokenType.EndObject || TokenType == JsonTokenType.EndArray)
                depth--;
        }

        _keepBuffers = false;

        // end of object found, extract json reader for deserializer
        var newJsonReader = new Utf8JsonReader(new ReadOnlySequence<byte>(firstSegment!, firstSegmentStartIndex, _lastSegment!, _lastSegmentEndIndex).Slice(tokenStartIndex, _jsonReader.Position), true, default);

        // deserialize value
        var result = JsonSerializer.Deserialize<T>(ref newJsonReader, options);

        // release memory if possible
        firstSegmentStartIndex = _firstSegmentStartIndex + (int)_jsonReader.BytesConsumed;

        while (firstSegment?.Memory.Length < firstSegmentStartIndex)
        {
            firstSegmentStartIndex -= firstSegment.Memory.Length;
            firstSegment.Dispose();
            firstSegment = (SequenceSegment?)firstSegment.Next;
        }

        if (firstSegment != _firstSegment)
        {
            _firstSegment = firstSegment;
            _firstSegmentStartIndex = firstSegmentStartIndex;
            _jsonReader = new Utf8JsonReader(new ReadOnlySequence<byte>(_firstSegment!, _firstSegmentStartIndex, _lastSegment!, _lastSegmentEndIndex), _isFinalBlock, _jsonReader.CurrentState);
        }

        return result;
    }

    public void Dispose() =>_lastSegment?.Dispose();

    public int CurrentDepth => _jsonReader.CurrentDepth;
    public bool HasValueSequence => _jsonReader.HasValueSequence;
    public long TokenStartIndex => _jsonReader.TokenStartIndex;
    public JsonTokenType TokenType => _jsonReader.TokenType;
    public ReadOnlySequence<byte> ValueSequence => _jsonReader.ValueSequence;
    public ReadOnlySpan<byte> ValueSpan => _jsonReader.ValueSpan;

    public bool GetBoolean() => _jsonReader.GetBoolean();
    public byte GetByte() => _jsonReader.GetByte();
    public byte[] GetBytesFromBase64() => _jsonReader.GetBytesFromBase64();
    public string GetComment() => _jsonReader.GetComment();
    public DateTime GetDateTime() => _jsonReader.GetDateTime();
    public DateTimeOffset GetDateTimeOffset() => _jsonReader.GetDateTimeOffset();
    public decimal GetDecimal() => _jsonReader.GetDecimal();
    public double GetDouble() => _jsonReader.GetDouble();
    public Guid GetGuid() => _jsonReader.GetGuid();
    public short GetInt16() => _jsonReader.GetInt16();
    public int GetInt32() => _jsonReader.GetInt32();
    public long GetInt64() => _jsonReader.GetInt64();
    public sbyte GetSByte() => _jsonReader.GetSByte();
    public float GetSingle() => _jsonReader.GetSingle();
    public string GetString() => _jsonReader.GetString();
    public uint GetUInt32() => _jsonReader.GetUInt32();
    public ulong GetUInt64() => _jsonReader.GetUInt64();
    public bool TryGetDecimal(out byte value) => _jsonReader.TryGetByte(out value);
    public bool TryGetBytesFromBase64(out byte[] value) => _jsonReader.TryGetBytesFromBase64(out value);
    public bool TryGetDateTime(out DateTime value) => _jsonReader.TryGetDateTime(out value);
    public bool TryGetDateTimeOffset(out DateTimeOffset value) => _jsonReader.TryGetDateTimeOffset(out value);
    public bool TryGetDecimal(out decimal value) => _jsonReader.TryGetDecimal(out value);
    public bool TryGetDouble(out double value) => _jsonReader.TryGetDouble(out value);
    public bool TryGetGuid(out Guid value) => _jsonReader.TryGetGuid(out value);
    public bool TryGetInt16(out short value) => _jsonReader.TryGetInt16(out value);
    public bool TryGetInt32(out int value) => _jsonReader.TryGetInt32(out value);
    public bool TryGetInt64(out long value) => _jsonReader.TryGetInt64(out value);
    public bool TryGetSByte(out sbyte value) => _jsonReader.TryGetSByte(out value);
    public bool TryGetSingle(out float value) => _jsonReader.TryGetSingle(out value);
    public bool TryGetUInt16(out ushort value) => _jsonReader.TryGetUInt16(out value);
    public bool TryGetUInt32(out uint value) => _jsonReader.TryGetUInt32(out value);
    public bool TryGetUInt64(out ulong value) => _jsonReader.TryGetUInt64(out value);

    private sealed class SequenceSegment : ReadOnlySequenceSegment<byte>, IDisposable
    {
        internal IMemoryOwner<byte> Buffer { get; }
        internal SequenceSegment? Previous { get; set; }
        private bool _disposed;

        public SequenceSegment(int size, SequenceSegment? previous)
        {
            Buffer = MemoryPool<byte>.Shared.Rent(size);
            Previous = previous;

            Memory = Buffer.Memory;
            RunningIndex = previous?.RunningIndex + previous?.Memory.Length ?? 0;
        }

        public void SetNext(SequenceSegment next) => Next = next;

        public void Dispose()
        {
            if (!_disposed)
            {
                _disposed = true;
                Buffer.Dispose();
                Previous?.Dispose();
            }
        }
    }
}

You can use it as replacement for Utf8JsonReader, or for deserializing json into typed objects (as wrapper around System.Text.Json.JsonSerializer.Deserialize).

Example of usage for deserializing objects from huge JSON array:

using var stream = new FileStream("LargeData.json", FileMode.Open, FileAccess.Read);
using var jsonStreamReader = new Utf8JsonStreamReader(stream, 32 * 1024);

jsonStreamReader.Read(); // move to array start
jsonStreamReader.Read(); // move to start of the object

while (jsonStreamReader.TokenType != JsonTokenType.EndArray)
{
    // deserialize object
    var obj = jsonStreamReader.Deserialize<TestData>();

    // JsonSerializer.Deserialize ends on last token of the object parsed,
    // move to the first token of next object
    jsonStreamReader.Read();
}

Deserialize method reads data from stream until it finds end of the current object. Then it constructs a new Utf8JsonReader with data read and calls JsonSerializer.Deserialize.

Other methods are passed through to Utf8JsonReader.

And, as always, don't forget to dispose your objects at the end.

riQQ
  • 9,878
  • 7
  • 49
  • 66
mtosh
  • 406
  • 4
  • 4
  • Ref structs are a C# 7.2 feature, are they not? – Timo May 07 '19 at 11:05
  • 2
    Ref struct **dispose** is a C# 8 feature. In C# 8, above struct can be used in using statement, while in C# 7.2 it cannot be. You would have to dispose it manually in C# 7.2. – mtosh May 09 '19 at 15:33
  • 1
    How to you use it with `JsonSerializer.Deserialize` ? I try to deserialize a full complex type one by one from an array. – Kalten Sep 28 '19 at 14:58
  • @Skami It's a struct since I tried to keep it structurally as similar as possible to Utf8JsonReader (I use it interchangeably depending on need). And since it has Utf8JsonReader (which is ref struct) as a property, it is also required to be ref struct as well. – mtosh Oct 08 '19 at 18:50
  • @Kalten This wrapper is not intended for use in combination with Json.NET. Json.NET already has a decent API for parsing large typed arrays, check https://stackoverflow.com/questions/43747477/how-to-parse-huge-json-file-as-stream-in-json-net – mtosh Oct 08 '19 at 18:55
  • 1
    @mtosh I was talking about `System.Text.Json.JsonSerializer.Deserialize(...)`. I want to completely replace Json.Net by system.text.json but seem like I need to rewrite the type deserializer myself – Kalten Oct 08 '19 at 20:02
  • 1
    @Kalten I have added wrapper around JsonSerializer.Deserialize and example of usage to the answer. Please note that JsonSerializer.Deserialize can deserialize only a single object at time, thus we have to read the data (but not parse) until we find the end of the current object, and then construct Utf8JsonReader on given segment of data. – mtosh Oct 13 '19 at 15:56
  • @mtosh Thx for the update. I mostly achieve to use it. The code work well. It read until first StartArray token and then start the deserialize loop. But after 6 row, it fail with error `Expected end of string, but instead reached end of data. LineNumber: 156 | BytePositionInLine: 12.` I tried with many buffer size but same result. Not sure if this is the right place to discus about that. – Kalten Oct 14 '19 at 22:04
  • 7
    @mtosh Thanks for this sample, it made a great starting point. I've been using it with some large json samples and found a few bugs. I've posted a fixed up version at https://github.com/evil-dr-nick/utf8jsonstreamreader/blob/master/Utf8JsonStreamReader/Utf8JsonStreamReader.cs – evil-dr-nick Nov 25 '19 at 14:22
  • 1
    Thanks for the helpful class. Why haven'r Microsoft provided a class to do this? – Guerrilla Jan 11 '20 at 00:47
  • **mtosh** thanks for providing the initial version. And @evil-dr-nick thanks for sharing the fixed version. Apparently this is an ongoing issue, you can find it here. [Enable Utf8JsonReader to read json from stream](https://github.com/dotnet/runtime/issues/30328). – om-ha Feb 18 '20 at 09:39
  • @mtosh Could you please elaborate on what you mean with **don't forget to dispose your objects at the end**? Isn't the [using statement](https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/using-statement) already self-disposing at the end of the local scope? i.e. `Dispose` is invoked in a deferred manner. – om-ha Feb 18 '20 at 10:03
3

With .NET 6 or later, we can use the DeserializeAsyncEnumerable method to read in streaming fashion over a large JSON file that has an array of items. I've used this to process a 5 GB JSON file with >100,000 items.

using var file = File.OpenRead(path);
var items = JsonSerializer.DeserializeAsyncEnumerable<JsonElement>(file);
await foreach (var item in items)
{
    // Process JSON object
}
Noah Stahl
  • 6,905
  • 5
  • 25
  • 36
  • 3
    This is nice, but it only works if the large array is at the "root"/first level. It does not work if you have something like `{ "items": ["one", "two", "three"] }` which I find is very common. I would love to have something along the lines of the good old XmlReader for "SAX" like parsing. https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmlreader?f1url=%3FappId%3DDev16IDEF1%26l%3DEN-US%26k%3Dk(System.Xml.XmlReader)%3Bk(DevLang-csharp)%26rd%3Dtrue&view=net-7.0 – Thomas Olsson Mar 30 '23 at 13:56
1

If you use async, there is a method that takes a stream (plus the generic version)

DeserializeAsync(Stream utf8Json, Type returnType, JsonSerializerOptions options = null, CancellationToken cancellationToken = default);
user1664043
  • 695
  • 5
  • 14