2

I'm creating a migration tool from sql to mongo, as part of the sql queries I'm using for json auto which results in a json response form sql server. using dapper looks like that...

var jsonResults = _SqlDb.Query<string>(usersWithDynamicDataQuery, buffered:false, commandTimeout: _SqlCommandTimeoutSeconds);
var jsonResult = string.Concat(jsonResults);                
var userDocuments = JsonSerializer.Deserialize<List<UserData>>(jsonResult);

so sql is returning a List with chunks of the full json response I need to find some more "Memory Flexible" way other than just a string.concat(..) as I'm hitting the CLR limit for a string memory allocation :\

I can always limit the and page over the queries with ROW_NUMBER().. but I really want to utilize as much memory as I can (I have 128GB on the machine) here and make the migration swift with large chunks of data...

dbc
  • 104,963
  • 20
  • 228
  • 340
Mortalus
  • 10,574
  • 11
  • 67
  • 117
  • Have you tried using a StringBuilder? – carlos chourio Apr 07 '20 at 12:13
  • That's what concat is dooing .. im gonna hit the same limit won't I? – Mortalus Apr 07 '20 at 12:14
  • It's not actually the same, since strings are immutable objects string.concat has to create an entire new object in memory with the combination of the previous string, and the next one, string builder uses the same object in memory and adds it the new value, here you can see a simple explanation https://www.tutorialsteacher.com/csharp/csharp-stringbuilder I don't know if the StringBuilder solves your problem, but you could give it a try. – carlos chourio Apr 07 '20 at 12:39
  • Is each element of the enumerable a single value? Can you construct each UserData from one string? – Jeremy Lakeman Apr 07 '20 at 13:07
  • @JeremyLakeman unfurtenatly no .. they are chunks of 2033 characters per sql server's response to `for json auto` .. thats the real issue im having .. need a full valid json stream to start deserilization into array .. and per string builder it didn't help .. we are talking GBs of memory here for a single string .. its too much for the CLR – Mortalus Apr 07 '20 at 13:36
  • Couldn't you output to file one row by one? When dealing with large chuncks of data, streaming is your best friend. I do not understand why you need to put everything as JSON at once. What are you gonna import into Mongo? A file/stream/what? – Alberto Chiesa Apr 07 '20 at 14:36
  • I'd still suggest working out how to use `Utf8JsonReader` & `JsonReaderState` to parse chunks of data, and enumerate the top level json elements, then deserialise each element. No idea how to do that though... – Jeremy Lakeman Apr 07 '20 at 15:22

2 Answers2

3

It is possible to deserialize a single JSON payload from a list of strings representing chunked JSON by constructing a ReadOnlySequence<byte> from the list, then constructing a Utf8JsonReader from the sequence, and finally deserializing using the reader via JsonSerializer.Deserialize<TValue>(Utf8JsonReader, JsonSerializerOptions).

The following is a minimal implementation:

public static partial class JsonExtensions
{
    public static TValue Deserialize<TValue>(IEnumerable<string> buffers, JsonSerializerOptions options = null)
    {
        return Deserialize<TValue>(ToByteArrayChunks(buffers));
    }

    public static TValue Deserialize<TValue>(IEnumerable<byte []> buffers, JsonSerializerOptions options = null)
    {
        var sequence = ReadOnlySequenceFactory.Create(buffers);
        var reader = new Utf8JsonReader(sequence, options.GetReaderOptions());                      
        return JsonSerializer.Deserialize<TValue>(ref reader, options);
    }

    public static JsonReaderOptions GetReaderOptions(this JsonSerializerOptions options)
    {
        if (options == null)
            return new JsonReaderOptions();
        else
            return new JsonReaderOptions
            {
                AllowTrailingCommas = options.AllowTrailingCommas,
                CommentHandling = options.ReadCommentHandling,
                MaxDepth = options.MaxDepth
            };          
    }

    static readonly Encoding encoding = new UTF8Encoding(false);

    static IEnumerable<byte []> ToByteArrayChunks(IEnumerable<string> buffers)
    {
        // By using an encoder we can handle the situation in which surrogate pairs enbedded in JSON string literals
        // are split between chunks.
        var encoder = encoding.GetEncoder();
        foreach (var s in buffers)
        {
            ReadOnlySpan<char> charSpan = s;
            var count = encoder.GetByteCount(charSpan, false);
            var bytes = new byte[count];
            Span<byte> byteSpan = bytes;
            encoder.GetBytes(charSpan, byteSpan, false);
            yield return bytes;
        }
    }
}

public static class ReadOnlySequenceFactory
{
    // There is no public concrete implementation of ReadOnlySequenceSegment<T> so we must create one ourselves.
    // This is modeled on https://github.com/dotnet/runtime/blob/master/src/libraries/System.Text.Json/tests/BufferFactory.cs
    // by https://github.com/ahsonkhan
    class ReadOnlyMemorySegment<T> : ReadOnlySequenceSegment<T>
    {
        public static ReadOnlySequence<T> Create(IEnumerable<ReadOnlyMemory<T>> buffers)
        {
            ReadOnlyMemorySegment<T> first = null;
            ReadOnlyMemorySegment<T> current = null;
            foreach (var buffer in buffers)
            {
                var next = new ReadOnlyMemorySegment<T> { Memory = buffer };
                if (first == null)
                {
                    first = next;
                }
                else
                {
                    current.Next = next;
                    next.RunningIndex = current.RunningIndex + current.Memory.Length;
                }
                current = next;
            }
            if (first == null)
            {
                first = current = new ReadOnlyMemorySegment<T>();
            }

            return new ReadOnlySequence<T>(first, 0, current, current.Memory.Length);
        }
    }

    public static ReadOnlySequence<T> Create<T>(IEnumerable<T []> buffers)
    {
        return ReadOnlyMemorySegment<T>.Create(buffers.Select(b => new ReadOnlyMemory<T>(b)));
    }
}

Notes:

  • A ReadOnlySequence<T> is constructed from a linked list of ReadOnlySequenceSegment<T> objects -- but this type is abstract and .NET Core 3.1 doesn't seem to include concrete public implementation. I modeled the implementation above on this one by Ahson Khan.

  • JsonSerializer is designed to deserialize from UTF-8 encoded byte sequences rather than from strings or character arrays, so if you can make your database access layer return a list of UTF-8 byte arrays rather than strings, you will get better performance and avoid the step of encoding each string chunk into bytes.

    If this isn't possible and your input is definitely a long list of smallish strings (2033 characters), it might be worthwhile to investigate using memory or array pooling to allocate the necessary UTF-8 byte sequences.

  • While this approach avoids allocating a single, huge string or byte [], the entire JSON payload is nevertheless loaded into memory all at once as a sequence of chunks. Thus this is not a true streaming solution.

  • If you are interested in a true streaming solution and can access your JSON data directly as a Stream, you might look at this answer to Parsing a JSON file with .NET core 3.0/System.text.Json by mtosh.

Demo fiddle here.

dbc
  • 104,963
  • 20
  • 228
  • 340
  • A true streaming solution would look similar to https://github.com/dotnet/runtime/blob/0b02d280a9d80b0ecdc858a948eb202e4be92653/src/libraries/System.Text.Json/src/System/Text/Json/Serialization/JsonSerializer.Read.Stream.cs#L89. No idea how easy it would be to re-use parts of this solution. – Jeremy Lakeman Apr 09 '20 at 01:47
  • I think the trick would be to call `reader.TrySkip()` to ensure we have merged enough `ReadOnlyMemory` buffers for a whole object, then call `JsonSerializer.Deserialize`. – Jeremy Lakeman Apr 09 '20 at 02:47
  • After more investigation, the key trick will be to transform `IEnumerable>` into the `IEnumerable<(ReadOnlySequence sequence, ReadOnlySpan span)>` of each array element, similar to; https://github.com/dotnet/runtime/blob/0b02d280a9d80b0ecdc858a948eb202e4be92653/src/libraries/System.Text.Json/src/System/Text/Json/Serialization/JsonSerializer.Read.Utf8JsonReader.cs#L165 so the value can be Deserialized. – Jeremy Lakeman Apr 09 '20 at 03:15
0

this is not the answer but may lead to it apologies. I had a similar issue recently, with HTTP, and the trick was not to build the intermediary string, as you have identified. I found that if I used a stream instead I could just miss out the middle man entirely. Kev Dockx did some work in this area and has a useful nuget called Marvin.StreamExtensions for processing Json. You need to produce a stream from your Query to make it work however.....

var userDocuments = stream.ReadAndDeserializeFromJson<List<UserData>>();

Check out these links for Foreach based solutions? Is this Dapper, never used it, but following might be helpful. Explanation of dapper buffer/cache

It is possible to stream a large SQL Server database result set using Dapper?

Git hub stuff(produces a stream from Query) but as you are "bufered : false" pretty sure you can just foreach it (?): https://github.com/JocaPC/Dapper.Stream