0

I need help converting huge Xml to Json format.

I've been researching about this topic. I found this:

How to convert JSON to XML or XML to JSON?

Reading large XML documents in .net

Reading and manpulating large xml of 1 GB

Well, The easy way is something like that:

XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
string jsonText = JsonConvert.SerializeXmlNode(doc);

but I can't use it because my file is huge (2GB) So I get OutOfMemoryException.

So, I need another way for read the large file. I've been using this way:

    using (XmlReader xr = XmlReader.Create(inputPath))
    {
            while (xr.Read())
            {
                 switch(xr.NodeType)
                 {
                     case XmlNodeType.Element:
                       //Do things
                       break;
                     case XmlNodeType.Text:
                       //Do things
                       break;
                     case XmlNodeType.EndElement:
                       //Do things
                       break;
                 }
            }
     }

I read the xml file and convert from xml to json concatenating strings tag by tag. But it's convoluted and extremely inefficient and It doesn't work correctly.

When i was researching, i found LINQ to XML. But I don't know how to use it. I think is good for manipulating and filter the huge xml but i need to read the whole file.

My Xml file looks like:

<?xml version="1.0" encoding="utf-8"?>
<root>
   <key>
      <item> value </value>
      <item> value2 </value>
      <item> value3 </value>
   </key>

   <id>1</id>
   <name>Foo</name>

   <hugeArray> //This array has around 12 millions of entries. Here is my problem.
     <item>
        <direction> </direction>
        <companyId> </companyId>
        <nameId> </nameId>
     </item>
     <item>
        <direction> </direction>
        <companyId> </companyId>
        <nameId> </nameId>
     </item>
      ....
   </hugeArray>
</root>

I found my problems with the array. I don't know how to cut and read it.

How should i read the whole file? How should i write the json?

I was concatenating characteres but i could use JsonWriter class.

UPDATE:

The algorithm should be able to convert from any xml to json.

Maverick94
  • 227
  • 4
  • 15
  • @JohnB but i'm using it with `XmlReader` – Maverick94 Jul 16 '19 at 07:17
  • have you tried Load instead of LoadXml https://learn.microsoft.com/en-us/dotnet/api/system.xml.xmldocument.load?view=netframework-4.8? I am guessing that should stream it in parts instead of loading all at once. – Slai Jul 16 '19 at 07:43
  • @Slai Yes. It starts reading but at 5 minutes i get `OutOfMemoryException`. – Maverick94 Jul 16 '19 at 07:46

2 Answers2

2

Try the recommended Microsoft technique: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/how-to-perform-streaming-transform-of-large-xml-documents

So, for example, you have the following part of code:

                while (reader.Read())  
                {  
                    if (reader.NodeType == XmlNodeType.EndElement)  
                        break;  
                    if (reader.NodeType == XmlNodeType.Element  
                        && reader.Name == "Item")  
                    {  
                        item = XElement.ReadFrom(reader) as XElement;  
                        if (item != null)  
                        {  
                            //here you can get data from your array object
                            //and put it to your JSON stream
                        }  
                    }  
                } 

If you want to define the type of element you can check if it has children: How to check if XElement has any child nodes?

It should work good in pair with streaming of JSON. For more info about JSON steaming look into: Writing JSON to a stream without buffering the string in memory

kami4ka
  • 187
  • 1
  • 12
  • I saw this link in my researching but... I can't find the way to do it. I think this technique is for filter data but i need the whole file. – Maverick94 Jul 16 '19 at 07:24
  • Edited. Please check the second part with streaming of a JSON. There you can find an example for the writing of parsed arrays and objects. – kami4ka Jul 16 '19 at 07:26
  • I could use the `JsonWriter` like the example but firstly, i need to read the whole file and i think the streaming transform is for filter data. I don't know the way to read all. – Maverick94 Jul 16 '19 at 07:40
  • Probably there is no way to read THE WHOLE file without placing it to memory and only then iterate over the nodes and transform them to JSON. You should read the nodes from XML stream and put transformed nodes to JSON, then read a new bunch of stream, transform, put to JSON. And so on, until the end of the file. – kami4ka Jul 16 '19 at 07:44
  • yes, you're right. But How should i read it? I mean, I have an array with 12 millions of entries. How do you use a LINQ there? I can't filter the data. Can i use a LINQ for read by parts? – Maverick94 Jul 16 '19 at 07:49
  • Well, but how do you know when start an array? You can't know it with `XmlReader`. In this case, we have hardcoded the name of the tag but this algorithm should be able to read any xml file. – Maverick94 Jul 16 '19 at 08:08
  • You can check that node name is `hugeArray`. And inside this node you can find your array items. Please, read the Microsoft article and try to understand how it works. There `Customer` represents your `hugeArray`. – kami4ka Jul 16 '19 at 08:10
  • You can check that node name for this xml file. If you have another file, it could be another array name. And with `XmlReader` you can't know if the node is an array or not. – Maverick94 Jul 16 '19 at 08:16
  • You can check whether XML element contains other elements or not to define how to transform it. Please refer: https://stackoverflow.com/questions/37811047/how-to-check-if-xelement-has-any-child-nodes Also, while converting to JSON you have to define which element will be an array and which will be an object. But this is a detail of the implementation of your own algorithm and not related to question. – kami4ka Jul 16 '19 at 08:31
  • still the same problem. you have to convert the xml node from the reader to a XElement for know if it has childs. There is a problem. If you have an array with 12 millions of entries, when you try to convert to XElement.... Just burst. – Maverick94 Jul 16 '19 at 08:39
  • You can check if the next N sub-elements are with the same name to define that this is an array, otherwise, it looks like an object. – kami4ka Jul 16 '19 at 08:42
  • Finally your aproach, helps to me. – Maverick94 Jul 16 '19 at 12:56
1

Huge files always require using XmlReader. I use a combination of XmlReader and Xml Linq in code below

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;

namespace ConsoleApplication120
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.xml";
        static void Main(string[] args)
        {
            List<Dictionary<string, string>> items = new List<Dictionary<string, string>>();
            XmlReader reader = XmlReader.Create(FILENAME);

            reader.ReadToFollowing("hugeArray");

            while (!reader.EOF)
            {
                if (reader.Name != "item")
                {
                    reader.ReadToFollowing("item");
                }
                if (!reader.EOF)
                {
                    XElement item = (XElement)XElement.ReadFrom(reader);
                    Dictionary<string, string> dict = item.Elements()
                        .GroupBy(x => x.Name.LocalName, y => (string)y)
                        .ToDictionary(x => x.Key, y => y.FirstOrDefault());

                    items.Add(dict);
                }
            }
        }
    }


}
jdweng
  • 33,250
  • 2
  • 15
  • 20