3

I have a huge (approx. 50GB) JSON file to deserialize. The JSON file consists of 14 arrays, and short example of it can be found here.

I wrote my POCO file, declaring 15 classes (one for each array, and a root class) and now I am trying to get my data in. Since the original data are huge and come in a zip file I am trying not to unpack the whole thing. Hence, the use of IO.Compression in the following code.

using System.IO.Compression;
using System.Text.Json;
using System.Text.Json.Nodes;

namespace read_and_parse
{
    internal class Program
    {
        static void Main() 
        {
            var fc = new Program();

            string zip_path = @"C:\Projects\BBR\Download_Total\example_json.zip";
            using FileStream file = File.OpenRead(zip_path);
            using (var zip = new ZipArchive(file, ZipArchiveMode.Read))
            {
                foreach (ZipArchiveEntry entry in zip.Entries)
                {

                    string[] name_split = entry.Name.Split('_');
                    string name = name_split.Last().Substring(0, name_split.Last().Length - 5);
                    bool canConvert = long.TryParse(name, out long number1);
                    if (canConvert == true)
                    {
                        Task task = fc.ParseJsonFromZippedFile(entry);
                    }
                }
            }
        }

        private async Task ParseJsonFromZippedFile(ZipArchiveEntry entry)
        {
            JsonSerializerOptions options = new JsonSerializerOptions { PropertyNamingPolicy = JsonNamingPolicy.CamelCase };
            await using Stream entryStream = entry.Open();

            IAsyncEnumerable<JsonNode?> enumerable = JsonSerializer.DeserializeAsyncEnumerable<JsonNode>(entryStream, options);
            await foreach (JsonNode? obj in enumerable) 
            {
                // Parse only subset of the object
                JsonNode? bbrSagNode = obj?["BBRSaglist"];
                if (bbrSagNode is null) continue;
                else
                {
                    var bbrSag = bbrSagNode.Deserialize<BBRSagList>();                    
                }
            }

        }

    }
}

Unfortunately I do not get anything out of it and it fails in the foreach-loop of the task. It fails with a System.Threading.Tasks.VoidTaskResult.

How do I get the data deserialized?

TomGeo
  • 1,213
  • 2
  • 12
  • 24
  • 2
    _"I have a huge (approx. 50GB) JSON file to deserialize."_ - why?? This is madness! – Fildor Jul 12 '23 at 14:10
  • Try the vertx streaming json parser – Asad Awadia Jul 12 '23 at 14:12
  • 3
    Ah, and you need to make it a `async Task Main()` and `await fc.ParseJsonFromZippedFile(entry);` – Fildor Jul 12 '23 at 14:12
  • Well, the why is something that I am asking myself everytime... but I have no other choice... it's either JSON or XML -> pest or cholera! – TomGeo Jul 12 '23 at 14:14
  • 1
    I think he meant "how did you end with this" ? Is it due to a migration of some sort ? wouldn't it be much easier to automate that through a script ? this seems like a serious design flaw, then again technical debt is a thing so i can see how it might happen. – N.K Jul 12 '23 at 14:15
  • `var bbrSag = bbrSagNode.Deserialize();` - let's say you solved everything else... you don't seem to do anything with the value outside of letting it go out of scope and have GC devour it? – Fildor Jul 12 '23 at 14:15
  • @Fildor-standswithMods This is not the end of it. :) I will have to write the data into some SQL tables. – TomGeo Jul 12 '23 at 14:18
  • @N.K It's the only way I get the data from the distributor. – TomGeo Jul 12 '23 at 14:18
  • 1
    Wow. Yeah that's ridiculous (not you - the distributor). Even CSV would make more sense. At least you could feed that to DB Tooling directly ... – Fildor Jul 12 '23 at 14:19
  • @N.K My biggest issue is, that basically every language I know is attempting to read the entire file into RAM. Well obviously, that's also where that attempt fails. – TomGeo Jul 12 '23 at 14:20
  • Please [edit] your question to share a JSON sample. A Zip file sample (e.g. as a base64 encoded string) would be even more helpful. It doesn't need to be huge obviously, just enough entries to demonstrate the problem -- i.e. a [mcve]. – dbc Jul 12 '23 at 14:24
  • Yes, that's completely out of question, of course. That's exactly why I would slap the distributor around with a large trout. – Fildor Jul 12 '23 at 14:24
  • 1
    @Fildor-standswithMods Could use FastMember and get a DbDataReader over the enumerable, then just pass it to `SqlBulkCopy`. The whole thing would be fully streaming. – Charlieface Jul 12 '23 at 14:25
  • @Charlieface sounds like a plan. I am still flabbergasted by the audacity to just drop off a 50GB zipped json ... – Fildor Jul 12 '23 at 14:28
  • @dbc I agree it would be more convenient to have a zip file... The linked JSON file contains 2 items in each array of the original JSON file. That there are two items in each array is on purpose for this minimal example. – TomGeo Jul 12 '23 at 14:32
  • Either way, your problem is solved for the moment: use `await fc.ParseJsonFromZippedFile(entry);`. Do you have any further problems? – Charlieface Jul 12 '23 at 14:34
  • 1
    [this comment](https://stackoverflow.com/questions/76671482/how-to-deserialize-a-huge-complex-json-file#comment135175964_76671482) by @Fildor-standswithMods is correct, you need an `async Main()` and an `await fc.ParseJsonFromZippedFile(entry);` to actually stream through the file. Maybe doing that will fix your problem, but if not, we need a [mcve] to help you. – dbc Jul 12 '23 at 14:35
  • @TomGeo figured it would be something like that, and yes, after thinking about what i wrote you would still need to preparse that file into smaller bits or have some extreme levels of real time reading trickery going on to make it work, while 1) may be valid, 2) is IMO not even worth trying and both are absolutely a waste of time. No reason why the distributor couldn't give you a .csv instead, i'm willing to bet this is a migration type task and they just don't want to bother. Fight them on that is my recommendation. – N.K Jul 12 '23 at 14:37
  • @Charlieface Well, the line *IAsyncEnumerable enumerable = JsonSerializer.DeserializeAsyncEnumerable(entryStream, options);* returns a 'null' for the *enumerable*... – TomGeo Jul 12 '23 at 14:38
  • Most DB management tools provide some sort of multi format exporting options, they have no excuse and this is bullshit. – N.K Jul 12 '23 at 14:38
  • 1
    @TomGeo - then we will need to see a [mcve]. Maybe it's not actually JSON, maybe it's [NDJSON](http://ndjson.org/) and so can't be deserialized with System.Text.Json at all. Or maybe the root container is a `{ "data" : [ /* contents */ ] }` object and not an array. – dbc Jul 12 '23 at 14:39
  • I have uploaded the zip here: https://drive.google.com/file/d/1WIq5rTkj-6bLDJvVa3YzGW5cfkXsvk-_/view?usp=drive_link – TomGeo Jul 12 '23 at 14:41
  • 1
    Returns a null object while enumerating, or returns an actual enumerator that's null? The latter seems unlikely, it's against the spec which indicates it's a non-nullable value https://learn.microsoft.com/en-us/dotnet/api/system.text.json.jsonserializer.deserializeasyncenumerable?view=net-7.0#system-text-json-jsonserializer-deserializeasyncenumerable-1(system-io-stream-system-text-json-jsonserializeroptions-system-threading-cancellationtoken) – Charlieface Jul 12 '23 at 14:41
  • Can't access it, says I have to sign in. Also, is that the 50GB zip file? – dbc Jul 12 '23 at 14:43
  • @TomGeo That's an example, right? You didn't just post your _actual_ data into the internet? Right? R i g h t ? – Fildor Jul 12 '23 at 14:43
  • @Fildor-standswithMods yes, these are just sample data – TomGeo Jul 12 '23 at 14:46
  • @dbc access is set to public... – TomGeo Jul 12 '23 at 14:47
  • You need to use a parser with a visitor-style API, not one that tries to deserialize a DTO. – Ben Voigt Jul 12 '23 at 14:51
  • Got it. Your root container is indeed an object, not an array: `{ "BBRSagList": [] }`. You can't use `JsonSerializer.DeserializeAsyncEnumerable<>()` with that, it's designed for arrays only. In fact System.Text.Json does not provide support for streaming in general, it supports pipelining. If you want to do streaming in general you have to write lots of custom code, see [this answer](https://stackoverflow.com/a/55429664/3744182) by mtosh to [Parsing a JSON file with .NET core 3.0/System.text.Json](https://stackoverflow.com/q/54983533/3744182). Or switch to Json.NET which does do streaming. – dbc Jul 12 '23 at 14:51
  • Are you only interested in the contents of the `"BBRSagList"` array? There are other arrays in addition. – dbc Jul 12 '23 at 14:53
  • Sorry guys, I adjusted the code a bit, since all the name handling for the input file is not needed in the example. https://pastebin.com/kgkZ92sA – TomGeo Jul 12 '23 at 14:53
  • @dbc Thanks for looking at it and your answer! In general I am interested in all the arrays in the JSON object. But this can happen in an iterative way... it does not need to be all done in one perfect run. – TomGeo Jul 12 '23 at 14:56
  • @dbc I do not have a problem using Newtonsoft... – TomGeo Jul 12 '23 at 14:57

2 Answers2

3

Your root JSON container is not an array, it's an object:

{
    "BBRSagList": [ /* Contents of BBRSagList */ ],
    "BygningList": [ /* Contents of BygningList*/ ]
}

You will not be able to use JsonSerializer.DeserializeAsyncEnumerable<T> to deserialize such JSON because this method only supports async streaming deserialization of JSON arrays, not objects. And unfortunately System.Text.Json does not directly support streaming deserialization of objects, or even streaming in general, it supports pipelining. If you need to stream through a file using System.Text.Json you will need to build on this answer by mtosh to Parsing a JSON file with .NET core 3.0/System.text.Json.

As an alternative, you could use Json.NET which is designed for streaming via JsonTextReader. Your JSON object consists of multiple array-valued properties, and using Json.NET you will be able to stream through your entryStream asynchronously, load each array value into a JToken, then call some callback for each token.

First, introduce the following extension methods:

public static partial class JsonExtensions
{
    /// <summary>
    /// Asynchronously stream through a stream containing a JSON object whose properties have array values and call some callback for each value specified by property name
    /// The reader must be positioned on an object or an exception will be thrown.
    /// </summary>
    public static async Task StreamJsonObjectArrayPropertyValues(Stream stream, Dictionary<string, Action<JToken>> itemActions, FloatParseHandling? floatParseHandling = default, DateParseHandling? dateParseHandling = default, CancellationToken cancellationToken = default)
    {
        // StreamReader and JsonTextReader do not implement IAsyncDisposable so let the caller dispose the stream.
        using (var textReader = new StreamReader(stream, leaveOpen : true))
        using (var reader = new JsonTextReader(textReader) { CloseInput = false })
        {
            if (floatParseHandling != null)
                reader.FloatParseHandling = floatParseHandling.Value;
            if (dateParseHandling != null)
                reader.DateParseHandling = dateParseHandling.Value;
            await StreamJsonObjectArrayPropertyValues(reader, itemActions, cancellationToken).ConfigureAwait(false);
        }
    }

    /// <summary>
    /// Asynchronously stream through a given JSON object whose properties have array values and call some callback for each value specified by property name
    /// The reader must be positioned on an object or an exception will be thrown.
    /// </summary>
    public static async Task StreamJsonObjectArrayPropertyValues(JsonReader reader, Dictionary<string, Action<JToken>> actions, CancellationToken cancellationToken = default)
    {
        var loadSettings = new JsonLoadSettings { LineInfoHandling = LineInfoHandling.Ignore }; // For performance do not load line info.
        (await reader.MoveToContentAndAssertAsync(cancellationToken).ConfigureAwait(false)).AssertTokenType(JsonToken.StartObject);
        while ((await reader.ReadToContentAndAssertAsync(cancellationToken).ConfigureAwait(false)).TokenType != JsonToken.EndObject)
        {
            if (reader.TokenType != JsonToken.PropertyName)
                throw new JsonReaderException();
            var name = (string)reader.Value!;
            await reader.ReadToContentAndAssertAsync().ConfigureAwait(false);
            if (actions.TryGetValue(name, out var action) && reader.TokenType == JsonToken.StartArray)
            {
                await foreach (var token in reader.LoadAsyncEnumerable(loadSettings, cancellationToken).ConfigureAwait(false))
                {
                    action(token);
                }
            }
            else
            {
                await reader.SkipAsync().ConfigureAwait(false);
            }
        }
    }
    
    /// <summary>
    /// Asynchronously load and return JToken values from a stream containing a JSON array.  
    /// The reader must be positioned on an array or an exception will be thrown.
    /// </summary>
    public static async IAsyncEnumerable<JToken> LoadAsyncEnumerable(this JsonReader reader, JsonLoadSettings? settings = default, [EnumeratorCancellation] CancellationToken cancellationToken = default)
    {
        (await reader.MoveToContentAndAssertAsync().ConfigureAwait(false)).AssertTokenType(JsonToken.StartArray);
        cancellationToken.ThrowIfCancellationRequested();
        while ((await reader.ReadToContentAndAssertAsync(cancellationToken).ConfigureAwait(false)).TokenType != JsonToken.EndArray)
        {
            cancellationToken.ThrowIfCancellationRequested();
            yield return await JToken.LoadAsync(reader, settings, cancellationToken).ConfigureAwait(false);
        }
        cancellationToken.ThrowIfCancellationRequested();
    }

    public static JsonReader AssertTokenType(this JsonReader reader, JsonToken tokenType) => 
        reader.TokenType == tokenType ? reader : throw new JsonSerializationException(string.Format("Unexpected token {0}, expected {1}", reader.TokenType, tokenType));

    public static async Task<JsonReader> ReadToContentAndAssertAsync(this JsonReader reader, CancellationToken cancellationToken = default) =>
        await (await reader.ReadAndAssertAsync(cancellationToken).ConfigureAwait(false)).MoveToContentAndAssertAsync(cancellationToken).ConfigureAwait(false);

    public static async Task<JsonReader> MoveToContentAndAssertAsync(this JsonReader reader, CancellationToken cancellationToken = default)
    {
        if (reader == null)
            throw new ArgumentNullException();
        if (reader.TokenType == JsonToken.None)       // Skip past beginning of stream.
            await reader.ReadAndAssertAsync(cancellationToken).ConfigureAwait(false);
        while (reader.TokenType == JsonToken.Comment) // Skip past comments.
            await reader.ReadAndAssertAsync(cancellationToken).ConfigureAwait(false);
        return reader;
    }

    public static async Task<JsonReader> ReadAndAssertAsync(this JsonReader reader, CancellationToken cancellationToken = default)
    {
        if (reader == null)
            throw new ArgumentNullException();
        if (!await reader.ReadAsync(cancellationToken).ConfigureAwait(false))
            throw new JsonReaderException("Unexpected end of JSON stream.");
        return reader;
    }
}

And now you will be able to do the following, to process the entries in the "BBRSagList" array:

private static async Task ParseJsonFromZippedFile(ZipArchiveEntry entry)
{
    await using Stream entryStream = entry.Open();
    Dictionary<string, Action<JToken>> actions = new ()
    {
        ["BBRSagList"] = ProcessBBRSagList,
    };
    // Let each individual action recognize dates and times.
    await JsonExtensions.StreamJsonObjectArrayPropertyValues(entryStream , actions, dateParseHandling : DateParseHandling.None);
}

static void ProcessBBRSagList(JToken token)
{
    var brsagList = token.ToObject<BBRSagList>();
    
    // Handle each BBRSagList however you want.
    Console.WriteLine("Deserialized {0}, result = {1}", brsagList, JsonConvert.SerializeObject(brsagList));
}

Notes:

  • As observed by Fildor-standswithMods in comments, you must also declare your Main() method as public static async Task Main() and also await ParseJsonFromZippedFile(entry)

    public static async Task Main()
    {
        string zip_path = @"C:\Projects\BBR\Download_Total\example_json.zip";
        using FileStream file = File.OpenRead(zip_path);
        using (var zip = new ZipArchive(file, ZipArchiveMode.Read))
        {
            foreach (ZipArchiveEntry entry in zip.Entries)
            {
                string[] name_split = entry.Name.Split('_');
                string name = name_split.Last().Substring(0, name_split.Last().Length - 5);
                bool canConvert = long.TryParse(name, out long number1);
                if (canConvert)
                {
                    await ParseJsonFromZippedFile(entry);
                }
            }
        }
    }
    

    (I made ParseJsonFromZippedFile() a static method so there is no reason to allocate a Program instance.)

Demo fiddle here.

dbc
  • 104,963
  • 20
  • 228
  • 340
  • 1
    Thx for the mention :) – Fildor Jul 12 '23 at 16:03
  • `UTF8JosnReader` supports forward reading of a JSON document – Charlieface Jul 12 '23 at 19:57
  • @Charlieface - it does but not directly, and it's very tricky to use correctly because it's based on pipelines not streams and doesn't support an async `ReadAsync()` method. What you need to do is, if `Read()` returns false, check the state to see whether more is expected. If more is expected, unwind the stack and grab the next pipeline segment, then jump back down into what you were doing before. (MSFT's decision not to use its own `async` mechanism is quite mysterious to me.) – dbc Jul 12 '23 at 20:22
  • @Charlieface - [mtosh's answer](https://stackoverflow.com/a/55429664/3744182) has a wrapper that handles all that but it's quite complex and I haven't thoroughly digested it. – dbc Jul 12 '23 at 20:22
  • @Charlieface - none of that matters inside `JsonConverter.Read()` by the way because System.Text.Json preloads the entire contents of the incoming token before calling `JsonConverter.Read()`. It's only when outside a converter that it matters. – dbc Jul 12 '23 at 21:32
  • Point was you could loop the `UTF8JsonReader` yourself, then on each object call `serializer.Deserialize` – Charlieface Jul 12 '23 at 21:33
  • `UTF8JsonReader` constructors don't take a stream, they take a `ReadOnlySequence` or `ReadOnlySpan`. So I think you'd need to preload the entire file into memory to do that. But that's exactly what OP cannot do. So you would need to read part of the file, then go back to the stream and grab another sequence of bytes, then continue. – dbc Jul 12 '23 at 21:43
  • That's how it's supposed to be used. You are supposed to read a section of bytes from the stream, construct a reader over it and parse it, then take the `CurrentState` and store it so you can construct another reader for the next section of bytes. But might be complicated to make a `yield return` out of that, would probably need a custom `struct` iterator. – Charlieface Jul 12 '23 at 22:45
  • Sorry guys for my late reply, but after a full workday breaking my head on this stuff I clearly heard my garden saying "You need a break!". Back in the game now, and wow what an awesome work you did! Thank you so much guys, and you @dbc in particular! I will work on getting it running in my system. :-) – TomGeo Jul 13 '23 at 09:48
  • This is a great answer, i've learnt from it and i'm glad i decided to check on this thread. Thanks. – N.K Jul 13 '23 at 12:36
0

due to my company security i cant access to data example. please check if you have no root element. just a JsonArray/list ...

i made an example. try to use ToListAsync extension and you can deserialize each object and add to main list.. etc..

void Main()
{
    JsonSerializerOptions options = new JsonSerializerOptions { PropertyNamingPolicy = JsonNamingPolicy.CamelCase };
    var jobj = JsonObject.Parse("[{\"name\":\"Tom Cruise\",\"age\":56,\"Born At\":\"Syracuse, NY\",\"Birthdate\":\"July 3, 1962\",\"photo\":\"https://jsonformatter.org/img/tom-cruise.jpg\"},{\"name\":\"Robert Downey Jr.\",\"age\":53,\"Born At\":\"New York City, NY\",\"Birthdate\":\"April 4, 1965\",\"photo\":\"https://jsonformatter.org/img/Robert-Downey-Jr.jpg\"}]");
    
    var jStream = new System.IO.MemoryStream(System.Text.Encoding.UTF8.GetBytes(jobj.ToJsonString()));
    
    var _enumerable = Task.Run(() => System.Text.Json.JsonSerializer.DeserializeAsyncEnumerable<JsonNode>(jStream, options).ToListAsync());
    foreach (JsonNode obj in _enumerable.Result)
    {
        obj.Dump(obj["name"].ToString());
    }
}


public static class AsyncEnumerableExtensions
{
    public static async Task<List<T>> ToListAsync<T>(this IAsyncEnumerable<T> items,
        CancellationToken cancellationToken = default)
    {
        var results = new List<T>();
        await foreach (var item in items.WithCancellation(cancellationToken)
                                        .ConfigureAwait(false))
            results.Add(item);
        return results;
    }
}

enter image description here

Power Mouse
  • 727
  • 6
  • 16