2

I'm calling an API that returns its responses as JSON objects. One of the members of the JSON objects can have a really long (10MiB to 3GiB+) base-64 encoded value. For example:

{
    "name0": "value0",
    "name1": "value1",
    "data": "(very very long base-64 value here)",
    "name2": "value2"
}

I need the data and the other names/values from the body. How do I get this data?

I'm currently using Newtonsoft.Json to (de)serialize JSON data in this application, and for smaller chunks of data, I would usually have a Data property of type byte[], but this data can be more than 2GiB and even if it's smaller than that, there may be so many responses coming back that we could run out of memory.

I'm hoping there is a way to write a custom JsonConverter or something to serialize/deserialize the data gradually as a System.IO.Stream, but I'm not sure how to read a single string "token" that cannot itself fit into memory. Any suggestions?

Rand Random
  • 7,300
  • 10
  • 40
  • 88
  • 1
    What do you do with this data? – Etienne de Martel Jul 20 '23 at 14:12
  • 2
    It sounds like the wrong storage format was used. Even if the objects are stored as JSON-per-line, a 10MB BASE64 string is already too big. You'll have to use JsonReader either in JSON.NET or better yet, the equivalent in System.Text.Json, to consume JSON elements as they're read. I mention STJ because it tries to reduce allocations to a minimum, something really important in this case – Panagiotis Kanavos Jul 20 '23 at 14:14
  • 1
    yeah ... sending binary data via json ... – Selvin Jul 20 '23 at 14:14
  • 1
    Sending binary as JSON isn't great. HTTP already supports this since the 1980s, or whenever FORM POST was invented. Retrieving, since headers were invented, to allow sending extra information along with the response body. All browsers support this. You can retrieve just parts of a large file by requesting a range in the header – Panagiotis Kanavos Jul 20 '23 at 14:18
  • 1
    deserializing to Stream doesn't make a difference ...as data have to be stored somewhere ... you may try some streaming api with (prolly custom) JsonTextReader (parsing as it comes, thats why Etienne de Martel put his comment) and consume it as soon as it posible (so the storage would be tcp buffer) but it requires to wite some tool to translate base64 stream to byte stream on fly – Selvin Jul 20 '23 at 14:25
  • You may want to use XML instead of JSON. XML can work with huge XML files. – jdweng Jul 20 '23 at 14:25
  • @EtiennedeMartel Ultmately, I need to pass the data to another API as a `System.IO.Stream`. This other API is proprietary and I access it through a .NET library, so I don't know how that transfer actually works at the HTTP level. Even if I could just save the response's content to a file, that would be sufficient. – Sean Killian Jul 20 '23 at 14:27
  • 2
    Unfortunately, neither JSON.NET's JsonTextReader nor System.Text.Json's Utf8JsonReader have a method that retrieves a node as a stream. All the byte-related methods return the entire content at once. – Panagiotis Kanavos Jul 20 '23 at 14:30
  • 1
    `pass the data to another API as a System.IO.Stream` how much of the original request do you have to pass? All of it? In that case you could copy from the request stream from which you receive the request directly to the request stream that you make to that other API. You won't be able to read the content yourself though, without deserializing it. If you only want to create a reverse proxy for that other service, you can use YARP configured for [direct forwarding](https://microsoft.github.io/reverse-proxy/articles/direct-forwarding.html) – Panagiotis Kanavos Jul 20 '23 at 14:33
  • @Selvin I understand that the data has to be stored somewhere. My hope is that if I could get the content as some sort of stream, then I could read it in chunks with `Stream.Read` and store them to a file or forward them directly to the API that I'm ultimately calling so I didn't have to load the entire content into memory all at once. – Sean Killian Jul 20 '23 at 14:35
  • Thanks, @PanagiotisKanavos, Unfortunately the JSON request that I am receiving is in a different format than the other API that I'm passing data to. I have to interact with this other API via a .NET library, so I'm not sure what underlying content type it's using. – Sean Killian Jul 20 '23 at 14:37
  • https://github.com/apache/commons-codec/blob/master/src/main/java/org/apache/commons/codec/binary/Base64InputStream.java ... yeah, java code ... but you some idea ... you are wrapping another stream ... unfortunalely you would have to go deeper unther Newtonsoft implementation to get private buffers ... also the stream which would be passed to base64stream have to search for `"` end send eof ther (and already readed stuff should be stored(prolly in original jsonread buffer so it could parse next token) not sure if it is even possible ... – Selvin Jul 20 '23 at 14:55

1 Answers1

0

A 3GiB+ string value is too large to fit in a .NET string, as it will exceed the maximum .NET string length. Thus you cannot use Json.NET to read your JSON response because Json.NET's JsonTextReader will always fully materialize property values as it reads, even when skipping then.

As for deserializing to a Stream or byte [] array, as noted in comments by Panagiotis Kanavos

Neither JSON.NET's JsonTextReader nor System.Text.Json's Utf8JsonReader have a method that retrieves a node as a stream. All the byte-related methods return the entire content at once.

Thus for sufficiently large data values you will exceed the maximum .NET array length.

So what are your options?

Firstly, I would encourage you to try to change the response format. JSON isn't an ideal format for huge Base64-encoded property values as, in general, most JSON serializers will fully materialize each property. Instead as suggested by Panagiotis Kanavos, send the binary data in the response body and the remaining properties as custom headers. Or see HTTP response with both binary data and JSON for additional options. If you do that you will be able to copy directly from the response body stream to some intermediate stream.

Secondly, you could attempt to generalize the code from this answer by mtosh to Parsing a JSON file with .NET core 3.0/System.text.Json. That answer shows how to iterate through a stream token-by-token using Utf8JsonReader from System.Text.Json. You could attempt to rewrite that answer to support reading of individual string values incrementally -- however I must admit that I do not know whether Utf8JsonReader actually supports reading portions of a property value in chunks without loading the entire value. As such, I can't recommend this approach.

Thirdly, you could adopt the approach from this answer to JsonConvert Deserialize Object out of memory exception and use the reader returned by JsonReaderWriterFactory.CreateJsonReader() to manually parse your JSON. This factory returns an XmlDictionaryReader that transcodes from JSON to XML on the fly, and thus supports incremental reading of Base64 properties via XmlReader.ReadContentAsBase64(Byte[], Int32, Int32). This is the reader used by WCF's DataContractJsonSerializer which is not recommended for new development, but has been ported to .NET Core, so can be used when no other options present themselves.

So, how would this work? First define a model corresponding to your JSON as follows, with your Data property represented as a Stream:

public partial class Model : IDisposable
{
    Stream data;

    public string Name0 { get; set; }
    public string Name1 { get; set; }
    [System.Text.Json.Serialization.JsonIgnore] // Added for debugging purposes
    public Stream Data { get => data; set => this.data = value; }
    public string Name2 { get; set; }
    
    public virtual void Dispose() => Interlocked.Exchange(ref data, null)?.Dispose();
}

Next, define the following extension methods:

public class JsonReaderWriterExtensions
{
    const int BufferSize = 8192;
    private static readonly Microsoft.IO.RecyclableMemoryStreamManager manager = new ();

    public static Stream CreateTemporaryStream() => 
        // Create some temporary stream to hold the deserialized binary data.  
        // Could be a FileStream created with FileOptions.DeleteOnClose or a Microsoft.IO.RecyclableMemoryStream
        // File.Create(Path.GetTempFileName(), BufferSize, FileOptions.DeleteOnClose);
        manager.GetStream();
    
    public static T DeserializeModelWithStreams<T>(Stream inputStream) where T : new() =>
        PopulateModelWithStreams(inputStream, new T());

    public static T PopulateModelWithStreams<T>(Stream inputStream, T model)
    {
        ArgumentNullException.ThrowIfNull(inputStream);
        ArgumentNullException.ThrowIfNull(model);

        var type = model.GetType();
        
        using (var reader = JsonReaderWriterFactory.CreateJsonReader(inputStream, XmlDictionaryReaderQuotas.Max))
        {
            // TODO: Stream-valued properties not at the root level.
            if (reader.MoveToContent() != XmlNodeType.Element)
                throw new XmlException();
            while (reader.Read() && reader.NodeType != XmlNodeType.EndElement)
            {
                switch (reader.NodeType)
                {
                    case XmlNodeType.Element:
                        var name = reader.LocalName;
                        // TODO:
                        // Here we could use use DataMemberAttribute.Name or other attributes to build a contract mapping the type to the JSON.
                        var property = type.GetProperty(name, BindingFlags.IgnoreCase | BindingFlags.Public | BindingFlags.Instance);
                        if (property == null || !property.CanWrite || property.GetIndexParameters().Length > 0 || Attribute.IsDefined(property, typeof(IgnoreDataMemberAttribute)))
                            continue;
                        // Deserialize the value
                        using (var subReader = reader.ReadSubtree())
                        {
                            subReader.MoveToContent();
                            if (typeof(Stream).IsAssignableFrom(property.PropertyType))
                            {
                                var streamValue = CreateTemporaryStream();  
                                byte[] buffer = new byte[BufferSize];
                                int readBytes = 0;
                                while ((readBytes = subReader.ReadElementContentAsBase64(buffer, 0, buffer.Length)) > 0)
                                    streamValue.Write(buffer, 0, readBytes);
                                if (streamValue.CanSeek)
                                    streamValue.Position = 0;
                                property.SetValue(model, streamValue);
                            }
                            else
                            {
                                var settings = new DataContractJsonSerializerSettings
                                {
                                    RootName = name,
                                    // Modify other settings as required e.g. DateTimeFormat.
                                };
                                var serializer = new DataContractJsonSerializer(property.PropertyType, settings);
                                var value = serializer.ReadObject(subReader);
                                if (value != null)
                                    property.SetValue(model, value);
                            }
                        }
                        Debug.Assert(reader.NodeType == XmlNodeType.EndElement);
                        break;
                    default:
                        reader.Skip();
                        break;
                }
            }
        }

        return model;
    }
}

And now you could deserialize your model as follows:

using var model = JsonReaderWriterExtensions.DeserializeModelWithStreams<Model>(responseStream);

Notes:

  1. Since the value of data may be arbitrarily large, you cannot deserialize its contents into a MemoryStream. Alternatives include:

    The demo code above uses RecyclableMemoryStream but you could change it to use a FileStream if you prefer. Either way you will need to dispose of it after you are done.

  2. I am using reflection to bind c# properties to JSON properties by name, ignoring case. For properties whose value type is not a Stream, I am using DataContractJsonSerializer to deserialize their values. This serializer has many quirks such as a funky default DateTime format so you may need to play around with your DataContractJsonSerializerSettings, or deserialize certain properties manually.

  3. My method JsonReaderWriterExtensions.DeserializeModelWithStreams() only supports Stream-valued properties at the root level. If you have nested huge Base64-valued properites you will need to rewrite JsonReaderWriterExtensions.PopulateModelWithStreams() to be recursive (which basically would amount to writing your own serializer).

  4. For a discussion of how the reader returned by JsonReaderWriterFactory transcodes from JSON to XML, see Efficiently replacing properties of a large JSON using System.Text.Json and Mapping Between JSON and XML.

Demo fiddle here.

dbc
  • 104,963
  • 20
  • 228
  • 340