56

I have a very, very large JSON file (1000+ MB) of identical JSON objects. For example:

[
    {
        "id": 1,
        "value": "hello",
        "another_value": "world",
        "value_obj": {
            "name": "obj1"
        },
        "value_list": [
            1,
            2,
            3
        ]
    },
    {
        "id": 2,
        "value": "foo",
        "another_value": "bar",
        "value_obj": {
            "name": "obj2"
        },
        "value_list": [
            4,
            5,
            6
        ]
    },
    {
        "id": 3,
        "value": "a",
        "another_value": "b",
        "value_obj": {
            "name": "obj3"
        },
        "value_list": [
            7,
            8,
            9
        ]

    },
    ...
]

Every single item in the root JSON list follows the same structure and thus would be individually deserializable. I already have the C# classes written to receive this data, and deserializing a JSON file containing a single object without the list works as expected.

At first, I tried to just directly deserialize my objects in a loop:

JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
    while (!sr.EndOfStream)
    {
        o = serializer.Deserialize<MyObject>(reader);
    }
}

This didn't work, threw an exception clearly stating that an object is expected, not a list. My understanding is that this command would just read a single object contained at the root level of the JSON file, but since we have a list of objects, this is an invalid request.

My next idea was to deserialize as a C# List of objects:

JsonSerializer serializer = new JsonSerializer();
List<MyObject> o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
    while (!sr.EndOfStream)
    {
        o = serializer.Deserialize<List<MyObject>>(reader);
    }
}

This does succeed. However, it only somewhat reduces the issue of high RAM usage. In this case it does look like the application is deserializing items one at a time, and so is not reading the entire JSON file into RAM, but we still end up with a lot of RAM usage because the C# List object now contains all of the data from the JSON file in RAM. This has only displaced the problem.

I then decided to simply try taking a single character off the beginning of the stream (to eliminate the [) by doing sr.Read() before going into the loop. The first object then does read successfully, but subsequent ones do not, with an exception of "unexpected token". My guess is this is the comma and space between the objects throwing the reader off.

Simply removing square brackets won't work since the objects do contain a primitive list of their own, as you can see in the sample. Even trying to use }, as a separator won't work since, as you can see, there are sub-objects within the objects.

What my goal is, is to be able to read the objects from the stream one at a time. Read an object, do something with it, then discard it from RAM, and read the next object, and so on. This would eliminate the need to load either the entire JSON string or the entire contents of the data into RAM as C# objects.

What am I missing?

fdmillion
  • 4,823
  • 7
  • 45
  • 82
  • 2
    This should give you idea.. basically use JsonTextReader and parse each object individually. http://stackoverflow.com/questions/32227436/parsing-large-json-file-in-net https://dotnetfiddle.net/2TQa8p – loneshark99 May 02 '17 at 21:31
  • 3
    Possible duplicate of [Parsing large json file in .NET](http://stackoverflow.com/questions/32227436/parsing-large-json-file-in-net) – Heretic Monkey May 02 '17 at 21:39
  • 1
    Why not use `yield return` keyword to stream each object one by one ? – Kalten May 02 '17 at 21:43

5 Answers5

77

This should resolve your problem. Basically it works just like your initial code except it's only deserializing object when the reader hits the { character in the stream and otherwise it's just skipping to the next one until it finds another start object token.

JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
    while (reader.Read())
    {
        // deserialize only when there's "{" character in the stream
        if (reader.TokenType == JsonToken.StartObject)
        {
            o = serializer.Deserialize<MyObject>(reader);
        }
    }
}
Leniel Maccaferri
  • 100,159
  • 46
  • 371
  • 480
nocodename
  • 1,246
  • 9
  • 15
  • 4
    Essentially, this approach skips forward until it sees the beginning of an object (skipping the 'start array' token), then reads the object. Presumably you'd then process it before going on to the next iteration of the loop. Since you don't have to keep all the objects in memory at once, only the one being operated on (`o`), it's much more memory-efficient. – Jonathan Rupp May 02 '17 at 21:48
51

I think we can do better than the accepted answer, using more features of JsonReader to make a more generalized solution.

As a JsonReader consumes tokens from a JSON, the path is recorded in the JsonReader.Path property.

We can use this to precisely select deeply nested data from a JSON file, using regex to ensure that we're on the right path.

So, using the following extension method:

public static class JsonReaderExtensions
{
    public static IEnumerable<T> SelectTokensWithRegex<T>(
        this JsonReader jsonReader, Regex regex)
    {
        JsonSerializer serializer = new JsonSerializer();
        while (jsonReader.Read())
        {
            if (regex.IsMatch(jsonReader.Path) 
                && jsonReader.TokenType != JsonToken.PropertyName)
            {
                yield return serializer.Deserialize<T>(jsonReader);
            }
        }
    }
}

The data you are concerned with lies on paths:

[0]
[1]
[2]
... etc

We can construct the following regex to precisely match this path:

var regex = new Regex(@"^\[\d+\]$");

it now becomes possible to stream objects out of your data (without fully loading or parsing the entire JSON) as follows

IEnumerable<MyObject> objects = jsonReader.SelectTokensWithRegex<MyObject>(regex);

Or if we want to dig even deeper into the structure, we can be even more precise with our regex

var regex = new Regex(@"^\[\d+\]\.value$");
IEnumerable<string> objects = jsonReader.SelectTokensWithRegex<string>(regex);

to only extract value properties from the items in the array.

I've found this technique extremely useful for extracting specific data from huge (100 GiB) JSON dumps, directly from HTTP using a network stream (with low memory requirements and no intermediate storage required).

spender
  • 117,338
  • 33
  • 229
  • 351
  • 9
    I actually like that more than my own answer :P – nocodename Feb 12 '20 at 12:09
  • I ran into an issue where my json file was missing the opening [ and closing ] so the jsonreader threw an error when parsing the first element. One way to fix that is with a CompsiteStream class that lets you prepend and append strings on your stream. Here's one I used with good results: https://stackoverflow.com/a/15800603/10221 – Steve Hiner Aug 05 '20 at 00:26
  • 1
    @SteveHiner Actually, it's possible to switch JSON.net into a mode that handles JSON streams (i.e. more than one root element). See https://www.newtonsoft.com/json/help/html/ReadMultipleContentWithJsonReader.htm – spender Mar 12 '21 at 17:59
  • 1
    I am having trouble creating regex for the nested main json in which I am interested. My file looks like `{ "property1": 1, "property2": 2, "payload": [{}, {},...,{}] }` I am interested in fetching only objects stored in payload array. – Nitin Kt Jul 13 '21 at 06:11
  • I'm also interested in how you would modify the regex here for a single nested json object - where the object has a property which is an array, and I would like to parse each object in the array one at a time. – mchristos Jan 12 '22 at 18:04
  • This regex gets me objects inside a list under any property: `\[\s*([^\[\]]*?)\s*\]` However like you I'm only after a single property like "payload" – mchristos Jan 12 '22 at 18:18
  • spender when we use this Regex search method doesn't it mean that we still load the entire stream into memmory for the Regex to be able to search it ? I am not sure of it, that is why I'm asking. On the other side, the solution that @nocodename presented seems to load in memory just the stream that is specified by the searched token. Please correct me if i'm wrong. In any case, thank you for presenting the solution. I just faced this issue and this post makes it make sense :) – SimpForJS Apr 21 '22 at 12:49
  • 1
    @SimpForJS When JsonReader steps through the JSON, it does so one token at a time and doesn't retain any history (other than the `Path` property). The `Path` property is maintained as the reader drops into and out of objects and arrays, but this doesn't require the entire JSON to be parsed in one go. – spender Apr 21 '22 at 14:16
11

.NET 6

This is easily done with the System.Text.Json.JsonSerializer in .NET 6:

using (FileStream? fileStream = new FileStream("hugefile.json", FileMode.Open))
{
    IAsyncEnumerable<Person?> people = JsonSerializer.DeserializeAsyncEnumerable<Person?>(fileStream);
    await foreach (Person? person in people)
    {
        Console.WriteLine($"Hello, my name is {person.Name}!");
    }
}
TFisicaro
  • 27
  • 2
  • 10
Josh Withee
  • 9,922
  • 3
  • 44
  • 62
0

Here is another easy way to parse large JSON file using Cinchoo ETL, an open source library (Uses JSON.NET under the hood to parse the json in stream manner)

using (var r = ChoJSONReader<MyObject>.LoadText(json)
       )
{
    foreach (var rec in r)
        Console.WriteLine(rec.Dump());
}

Sample fiddle: https://dotnetfiddle.net/i5qJ5R

Cinchoo
  • 6,088
  • 2
  • 19
  • 34
-5

Is this what you're looking for? Found on a previous question

The current version of Json.net does not allow you to use the accepted answer code. A current alternative is:

public static object DeserializeFromStream(Stream stream)
{
    var serializer = new JsonSerializer();

    using (var sr = new StreamReader(stream))
    using (var jsonTextReader = new JsonTextReader(sr))
    {
        return serializer.Deserialize(jsonTextReader);
    }
}

Documentation: Deserialize JSON from a file stream

AKTheKnight
  • 113
  • 1
  • 7