2

I'm trying to convert from a huge JSON file(2GB) to xml file. I have some troubles reading the huge JSON file.

I've been researching about how i can read huge JSON files.

I found this:

Out of memory exception while loading large json file from disk

How to parse huge JSON file as stream in Json.NET?

Parsing large json file in .NET

It seems that i'm duplicating my question but i have some troubles which aren't solved in these posts.

So, i need to load the huge JSON File and the community propose something like this:

MyObject o;

using (StreamReader sr = new StreamReader("foo.json"))
using (JsonTextReader reader = new JsonTextReader(sr))
{
    var serializer = new JsonSerializer();
    reader.SupportMultipleContent = true;

    while (reader.Read())
    {
        if (reader.TokenType == JsonToken.StartObject)
        {
            // Deserialize each object from the stream individually and process it
            var o = serializer.Deserialize<MyObject>(reader);

            //Do something with the object
        }
    }
}

So, We can read by parts and deserialize objects one by one.

I'll show you my code

JsonSerializer serializer = new JsonSerializer();

string hugeJson = "hugJSON.json";
using (FileStream s = File.Open(hugeJson , FileMode.Open))
{
    using (StreamReader sr = new StreamReader(s))
    {
         using (JsonReader reader = new JsonTextReader(sr))
         {
            reader.SupportMultipleContent = true;
            while (reader.Read())
            {
                 if (reader.TokenType == JsonToken.StartObject)
                 {
                      var jsonObject = serializer.Deserialize(reader);
                      string xmlString = "";

                       XmlDocument doc = JsonConvert.DeserializeXmlNode(jsonObject.ToString(), "json");

                       using (var stringWriter = new StringWriter())
                       {
                            using (var xmlTextWriter = XmlWriter.Create(stringWriter))
                            {
                                doc.WriteTo(xmlTextWriter);
                                xmlTextWriter.Flush();
                                xmlString = stringWriter.GetStringBuilder().ToString();
                             }
                         }
                  }
              }
          }
     }
}


But when i try doc.WriteTo(xmlTextWriter), i get Exception of type System.OutOfMemoryException was thrown.

I've been trying with BufferedStream. This class allows me manage big files but i have another problem.

I'm reading in byte[] format. When i convert to string, the json is splitted and i can't parse to xml file because there are missing characters

for example:

{ foo:[{
   foo:something,
   foo1:something,
   foo2:something
},
{
   foo:something,
   foo:som 

it is cutted.

Is any way to read a huge JSON and convert to XML without load the JSON by parts? or i could load a convert by parts but i don't know how to do this.

Any ideas?

UPDATE:

I have been trying with this code:

 static void Main(string[] args)
 {       
         string json = "";
         string pathJson = "foo.json";
         //Read file
         string temp = "";
         using (FileStream fs = new FileStream(pathJson, FileMode.Open))
         { 
             using (BufferedStream bf = new BufferedStream(fs))
             {
                 byte[] array = new byte[70000];
                 while (bf.Read(array, 0, 70000) != 0)
                 {

                      json = Encoding.UTF8.GetString(array);
                      temp = String.Concat(temp, json);


                 }
             }
         }


        XmlDocument doc = new XmlDocument();

        doc = JsonConvert.DeserializeXmlNode(temp, "json");


         using (var stringWriter = new StringWriter())
         using (var xmlTextWriter = XmlWriter.Create(stringWriter))
         {
             doc.WriteTo(xmlTextWriter);
             xmlTextWriter.Flush();
             xmlString = stringWriter.GetStringBuilder().ToString();
         }


         File.WriteAllText("outputPath", xmlString);


   }

This code convert from json file to xml file. but when i try to convert a big json file (2GB), i can't. The process cost a lot of time and the string doesn't have capacity to store all the json. How i can store it? Is any way to do this conversion without use the datatype string?

UPDATE: The json format is:

[{
    'key':[some things],
    'data': [some things],
    'data1':[A LOT OF ENTRIES],
    'data2':[A LOT OF ENTRIES],
    'data3':[some things],
    'data4':[some things]
}]
Maverick94
  • 227
  • 4
  • 15
  • 5
    Try avoiding in-memory I/O, such as `StringWriter` and output all the chunks to a file stream. You can keep appending to that file stream, no need of a new one for every chunk. If you can avoid deserializing altogether and instead read tokens and output elements that would help big time too – Sten Petrov May 22 '19 at 15:41
  • 1) What are you going to do with `xmlString` after generating it? You already have the `XmlDocument doc` representation, why do you need `xmlString` as well? 2) Can you please [edit] your question to share a JSON sample? – dbc May 22 '19 at 17:09
  • @dbc `xmlString` is worthless in this code. 2) Why did you need a JSON sample? i can't use a data model. The program must read any big JSON. – Maverick94 May 23 '19 at 08:55
  • @StenPetrov Is any way to do the conversion without use the datatype string? – Maverick94 May 23 '19 at 09:28
  • @Maverick94 - it would be helpful to have a JSON sample to ensure that our proposed answers work with your actual JSON. Now that you've simplified your code in a recent edit, that's less necessary. – dbc May 23 '19 at 10:10
  • 1
    For instance, you might run into the problem described in [XmlNodeConverter can only convert JSON that begins with an object](https://stackoverflow.com/q/48786123). – dbc May 23 '19 at 10:24
  • @dbc Finally, I attached a Json Format. – Maverick94 May 23 '19 at 11:20
  • May you should consider Cichoo ETL, it may helps. Can u paste sample input json and expected xml. – Cinchoo May 23 '19 at 13:23
  • 1) What are `data1` and `data2`? Are they primitives like strings or numbers, or are they objects? 2) The outer JSON container is an array which has a single entry. Could it have multiple entries? Is each entry in the outer array 2 GB? 3) Does your JSON file have a fixed schema or is it completely variable? Json.NET doesn't have a built-in method to do a purely streaming transformation from JSON to XML so you may need to code something specific for your data format. – dbc May 23 '19 at 17:55

1 Answers1

3

Out-of-memory exceptions in .Net can be caused by several problems including:

  1. Allocating too much total memory.

    If this might be happening, check whether you are running in 64-bit mode as described here. If not, rebuild in 64-bit mode as described here and re-test.

  2. Allocating too many objects on the large object heap causing memory fragmentation.

  3. Allocating a single object that is larger than the .Net object size limit.

  4. Failing to dispose of unmanaged memory (not applicable here).

In your case, you may be trying to allocate too much total memory but are definitely allocating three very large objects: the in-memory temp JSON string, the in-memory xmlString XML string and the in-memory stringWriter.

You can substantially reduce your memory footprint and completely eliminate these objects by constructing an XDocument or XmlDocument directly via a streaming translation from the JSON file. Then afterward, write the document directly to the XML file using XDocument.Save() or XmlDocument.Save().

To do this, you will need to allocate your own XmlNodeConverter, then construct a JsonSerializer using it and deserialize as shown in Deserialize JSON from a file. The following method(s) do the trick:

public static partial class JsonExtensions
{
    public static XDocument LoadXNode(string pathJson, string deserializeRootElementName)
    {
        using (var stream = File.OpenRead(pathJson))
            return LoadXNode(stream, deserializeRootElementName);
    }

    public static XDocument LoadXNode(Stream stream, string deserializeRootElementName)
    {
        // Let caller dispose the underlying streams.
        using (var textReader = new StreamReader(stream, Encoding.UTF8, true, 1024, true))
            return LoadXNode(textReader, deserializeRootElementName);
    }

    public static XDocument LoadXNode(TextReader textReader, string deserializeRootElementName)
    {
        var settings = new JsonSerializerSettings 
        { 
            Converters = { new XmlNodeConverter { DeserializeRootElementName = deserializeRootElementName } },
        };
        using (var jsonReader = new JsonTextReader(textReader) { CloseInput = false })
            return JsonSerializer.CreateDefault(settings).Deserialize<XDocument>(jsonReader);
    }

    public static void StreamJsonToXml(string pathJson, string pathXml, string deserializeRootElementName, SaveOptions saveOptions = SaveOptions.None)
    {
        var doc = LoadXNode(pathJson, deserializeRootElementName);
        doc.Save(pathXml, saveOptions);
    }
}

Then use them as follows:

JsonExtensions.StreamJsonToXml(pathJson, outputPath, "json");

Here I am using XDocument instead of XmlDocument because I believe (but have not checked personally) that it uses less memory, e.g. as reported in Some hard numbers about XmlDocument, XDocument and XmlReader (x86 versus x64) by Ken Lassesen.

This approach eliminates the three large objects mentioned previously and substantially reduces the chance of running out of memory due to problems #2 or #3.

Demo fiddle here.


If you are still running out of memory even after ensuring you are running in 64-bit mode and streaming directly from and to your file(s) using the methods above, then it may simply be that your XML is too large to fit in your computer's virtual memory space using XDocument or XmlDocument. If that is so, you will need to adopt a pure streaming solution that transforms from JSON to XML on the fly as it streams. Unfortunately, Json.NET does not provide this functionality out of the box, so you will need a more complex solution.

So, what are your options?

  1. You could fork your own version of XmlNodeConverter.cs and rewrite ReadElement(JsonReader reader, IXmlDocument document, IXmlNode currentNode, string propertyName, XmlNamespaceManager manager) to write directly to an XmlWriter instead of an IXmlDocument.

    While probably doable with a couple days effort, the difficulty would seem to exceed that of a single stackoverflow answer.

  2. You could use the reader returned by JsonReaderWriterFactory to translate JSON to XML on the fly, and pass that reader directly to XmlWriter.WriteNode(XmlReader). The readers and writers returned by this factory are used internally by DataContractJsonSerializer but can be used directly as well.

  3. If your JSON has a fixed schema (which is unclear from your question) you have many more straightforward options. Incrementally deserializing to some c# data model as shown in Parsing large json file in .NET and re-serializing that model to XML is likely to use much less memory than loading into some generic DOM such as XDocument.

Option #2 can be implemented very simply, as follows:

using (var stream = File.OpenRead(pathJson))
using (var jsonReader = JsonReaderWriterFactory.CreateJsonReader(stream, XmlDictionaryReaderQuotas.Max))
{
    using (var xmlWriter = XmlWriter.Create(outputPath))
    {
        xmlWriter.WriteNode(jsonReader, true);
    }
}

However, the XML thereby produced is much less pretty than the XML generated by XmlNodeConverter. For instance, given the simple input JSON

{"Root":[{
    "key":["a"],
    "data": [1, 2]
}]}

XmlNodeConverter will create the following XML:

<json>
  <Root>
    <key>a</key>
    <data>1</data>
    <data>2</data>
  </Root>
</json>

While JsonReaderWriterFactory will create the following (indented for clarity):

<root type="object">
  <Root type="array">
    <item type="object">
      <key type="array">
        <item type="string">a</item>
      </key>
      <data type="array">
        <item type="number">1</item>
        <item type="number">2</item>
      </data>
    </item>
  </Root>
</root>

The exact format of the XML generated can be found in Mapping Between JSON and XML.

Still, once you have valid XML, there are streaming XML-to-XML transformation solutions that will allow you to transform the generated XML to your final, desired format, including:

Is it possible to do the other way?

Unfortunately

JsonReaderWriterFactory.CreateJsonWriter().WriteNode(xmlReader, true);

isn't really suited for conversion of arbitrary XML to JSON as it only allows for conversion of XML with the precise schema specified by Mapping Between JSON and XML.

Furthermore, when converting from arbitrary XML to JSON the problem of array recognition exists: JSON has arrays, XML doesn't, it only has repeating elements. To recognize repeating elements (or tuples of elements where identically named elements may not be adjacent) and convert them to JSON array(s) requires buffering either the XML input or the JSON output (or a complex two-pass algorithm). Mapping Between JSON and XML avoids the problem by requiring type="object" or type="array" attributes.

dbc
  • 104,963
  • 20
  • 228
  • 340
  • I tryied with this. I've been able to convert some JSON Files bigger than before. But i still having memory error. I can't convert JSON with 2GB. The program throws me `System.OutOfMemoryException` in this line: `return JsonSerializer.CreateDefault(settings).Deserialize(jsonReader);` – Maverick94 May 23 '19 at 11:06
  • @Maverick94 Be sure you are running in 64 bit. See [What is the purpose of the “Prefer 32-bit” setting in Visual Studio and how does it actually work?](https://stackoverflow.com/q/12066638), [How to determine if a .NET assembly was built for x86 or x64?](https://stackoverflow.com/q/270531) and [Force C# app to compile as x64 instead of AnyCpu](https://stackoverflow.com/q/4414567). – dbc May 23 '19 at 17:55
  • @dbc the objective is to not have memory usage dependent on the input and here the JsonSerializer will still create an object that contains the whole data – Sten Petrov May 23 '19 at 18:24
  • @StenPetrov - it could be that OP is simply running out of memory, but it also could be that OP is allocating a single object that is too large. This answer prevents allocation of unnecessary huge strings in memory thereby preventing the second possibility. If OP really needs true streaming with fixed memory overhead then a more complex solution is required -- but let's check to see whether OP really needs this first. – dbc May 23 '19 at 20:01
  • 1
    @dbc Thanks for your answer. I've tryied with `JsonReaderWriterFactory.CreateJsonReader(stream, XmlDictionaryReaderQuotas.Max)` it works! I can convert from json file to xml file. So, My option was #2 and it's solved. – Maverick94 May 24 '19 at 08:09
  • @dbc Is it possible to do the other way? I mean from a big XML file from JSON. Exists `JsonReaderWriterFactory.CreateJsonReader(stream, XmlDictionaryReaderQuotas.Max)` that return a `JsonWriter`? – Maverick94 May 29 '19 at 14:58
  • 1
    @Maverick94 - As I recall `JsonReaderWriterFactory.CreateJsonWriter().WriteNode(xmlReader, true);` doesn't allow for conversion of arbitrary XML to JSON, it only allows for conversion of XML with the precise schema specified by [Mapping Between JSON and XML](https://learn.microsoft.com/en-us/dotnet/framework/wcf/feature-details/mapping-between-json-and-xml). – dbc May 29 '19 at 18:09
  • 1
    @Maverick94 - Also when converting from arbitrary XML to JSON the problem of *array recognition* exists: JSON has arrays, XML doesn't, it only has repeating elements. To recognize repeating elements and convert them to a JSON array requires buffering either the entire XML input or the entire JSON output (or a complex two-pass algorithm). [Mapping Between JSON and XML](https://learn.microsoft.com/en-us/dotnet/framework/wcf/feature-details/mapping-between-json-and-xml) avoids the problem by requiring `type="object"` or `type="array"` attributes. – dbc May 29 '19 at 18:11