6

I have a very large JSON file, now the car array below can be upto 100,000,000 records. The total file size can vary from 500mb to 10 GB. I am using Newtonsoft json.net

Input

{
"name": "John",
"age": "30",
"cars": [{
    "brand": "ABC",
    "models": ["Alhambra", "Altea", "AlteaXL", "Arosa", "Cordoba", "CordobaVario", "Exeo", "Ibiza", "IbizaST", "ExeoST", "Leon", "LeonST", "Inca", "Mii", "Toledo"],
    "year": "2019",
    "month": "1",
    "day": "1"
}, {
    "brand": "XYZ",
    "models": ["Alhambra", "Altea", "AlteaXL", "Arosa", "Cordoba", "CordobaVario", "Exeo", "Ibiza", "IbizaST", "ExeoST", "Leon", "LeonST", "Inca", "Mii", "Toledo"],
    "year": "2019",
    "month": "10",
    "day": "01"
}],
"TestCity": "TestCityValue",
"TestCity1": "TestCityValue1"}

Desired Output File 1 Json

   {
    "name": "John",
    "age": "30",
    "cars": {
        "brand": "ABC",
        "models": ["Alhambra", "Altea", "AlteaXL", "Arosa", "Cordoba", "CordobaVario", "Exeo", "Ibiza", "IbizaST", "ExeoST", "Leon", "LeonST", "Inca", "Mii", "Toledo"],
        "year": "2019",
        "month": "1",
        "day": "1"
    },
    "TestCity": "TestCityValue",
    "TestCity1": "TestCityValue1"
}

File 2 Json

{
    "name": "John",
    "age": "30",
    "cars": {
        "brand": "XYZ",
        "models": ["Alhambra", "Altea", "AlteaXL", "Arosa", "Cordoba", "CordobaVario", "Exeo", "Ibiza", "IbizaST", "ExeoST", "Leon", "LeonST", "Inca", "Mii", "Toledo"],
        "year": "2019",
        "month": "10",
        "day": "01"
    },
    "TestCity": "TestCityValue",
    "TestCity1": "TestCityValue1"
}

So I came up with the following code which kinda works

 public static void SplitJson(Uri objUri, string splitbyProperty)
    {
        try
        {
            bool readinside = false;
            HttpClient client = new HttpClient();
            using (Stream stream = client.GetStreamAsync(objUri).Result)
            using (StreamReader streamReader = new StreamReader(stream))
            using (JsonTextReader reader = new JsonTextReader(streamReader))
            {
                Node objnode = new Node();
                while (reader.Read())
                {
                    JObject obj = new JObject(reader);


                    if (reader.TokenType == JsonToken.String && reader.Path.ToString().Contains("name") && !reader.Value.ToString().Equals(reader.Path.ToString()))
                    {
                        objnode.name = reader.Value.ToString();
                    }

                    if (reader.TokenType == JsonToken.Integer && reader.Path.ToString().Contains("age") && !reader.Value.ToString().Equals(reader.Path.ToString()))
                    {
                        objnode.age = reader.Value.ToString();

                    }

                    if (reader.Path.ToString().Contains(splitbyProperty) && reader.TokenType == JsonToken.StartArray)
                    {
                        int counter = 0;
                        while (reader.Read())
                        {
                            if (reader.TokenType == JsonToken.StartObject)
                            {
                                counter = counter + 1;
                                var item = JsonSerializer.Create().Deserialize<Car>(reader);
                                objnode.cars = new List<Car>();
                                objnode.cars.Add(item);
                                insertIntoFileSystem(objnode, counter);
                            }

                            if (reader.TokenType == JsonToken.EndArray)
                                break;
                        }
                    }

                }

            }

        }
        catch (Exception)
        {

            throw;
        }
    }
    public static void insertIntoFileSystem(Node objNode, int counter)
    {

        string fileName = @"C:\Temp\output_" + objNode.name + "_" + objNode.age + "_" + counter + ".json";
        var serialiser = new JsonSerializer();
        using (TextWriter tw = new StreamWriter(fileName))
        {
            using (StringWriter textWriter = new StringWriter())
            {
                serialiser.Serialize(textWriter, objNode);
                tw.WriteLine(textWriter);
            }
        }
    }

ISSUE

  1. Any field after the array is not being captured when file is large in size. Is there a way to skip or do parallel processing of the reader for large array in json. In short I am not able to capture the below part using my code

    "TestCity": "TestCityValue", "TestCity1": "TestCityValue1"}

Manoj Choudhari
  • 5,277
  • 2
  • 26
  • 37
Tushar Narang
  • 1,997
  • 3
  • 21
  • 49
  • 3
    Your question started good but you have way too many issues in it so unfortunately it's too broad. – Zohar Peled Jan 23 '19 at 12:24
  • Will update my Question – Tushar Narang Jan 23 '19 at 12:37
  • @ZoharPeled: I hope it is to the point now. – Tushar Narang Jan 23 '19 at 12:40
  • can´t you gather the whole json and split it after? – Felix Arnold Jan 23 '19 at 13:03
  • 1
    Retracted my close vote. – Zohar Peled Jan 23 '19 at 13:09
  • @MaxMustermann gathering whole json at once will give memory out of exception, as the single object will not be able to hold such large data – Tushar Narang Jan 23 '19 at 13:32
  • Just so I understand what you are trying to do here-- you have JSON containing an array of 100M items and you want to split it into 100M files, one file for each item in the array, such that each file contains a copy of all of the information outside of the array as well? – Brian Rogers Jan 23 '19 at 17:40
  • @BrianRogers That is correct – Tushar Narang Jan 24 '19 at 07:14
  • Well then that sounds like a duplicate of [Strategy for splitting a large JSON file](https://stackoverflow.com/q/31410187/3744182)... or maybe not, *Be aware that there is a limit of 16384 open files at one time, so if you need to create more split files, this won't work.* – dbc Jan 24 '19 at 08:08
  • @dbc I implemented the same code line by line, the code has an issue as it is using string builder to append data, I used blockingcollection to solve it. The issue here is the memory out of exception as well. Also my question has a specific query marked under issue. – Tushar Narang Jan 24 '19 at 08:51
  • *I implemented the same code line by line, the code has an issue as it is using string builder to append data* - You mean you implemented the code from [this answer](https://stackoverflow.com/a/31422566/3744182)? Or had you not seen that answer? The answer there definitely does handle trailing as well as leading properties, you can see in the demo fiddle https://dotnetfiddle.net/Q1Kqdk that the property `"headerNamePost"` is added to each file. The only problem may be that it leaves too many files open at once. – dbc Jan 24 '19 at 09:13
  • @dbc Yes I initially searched stackoverflow and came accross the same post, so I tried to implement it, but it gave me out of memory exception. After trying several such answers given accross stackoverflow and google, I lost hope and started making my own code. My current code processes a 10 GB File in less than 82 MB so, i just need to figure out how to get elements after the array as well. it takes around 10.5 mins to do so. I am reading file from azure blob – Tushar Narang Jan 24 '19 at 09:20
  • Can I recommend this article about streaming jsons in this wiki https://en.wikipedia.org/wiki/JSON_streaming Maybe it helps. Otherwise you need a custom parser, and worse than this you need to transfer all car records before reading those field values. I think, taking querying of cars for each person as a completely separate process is a much better practice. Therefore this problem needs to be solved on JSON generating side. – Onur Feb 07 '19 at 11:06

1 Answers1

4

You are going to need to process your large JSON file in two passes to achieve the result you want.

In the first pass, split the file into two: create a file containing just the huge array, and a second file which contains all the other information, which will be used as a template for the individual JSON files you ultimately want to create.

In the second pass, read the template file into memory (I'm assuming this part of the JSON is relatively smallish so this should not be a problem), then use a reader to process the array file one item at a time. For each item, combine it with the template and write it to a separate file.

At the end, you can delete the temporary array and template files.

Here is what it might look like in code:

using System.IO;
using System.Text;
using System.Net.Http;
using Newtonsoft.Json;
using Newtonsoft.Json.Linq;

public static void SplitJson(Uri objUri, string arrayPropertyName)
{
    string templateFileName = @"C:\Temp\template.json";
    string arrayFileName = @"C:\Temp\array.json";

    // Split the original JSON stream into two temporary files:
    // one that has the huge array and one that has everything else
    HttpClient client = new HttpClient();
    using (Stream stream = client.GetStreamAsync(objUri).Result)
    using (JsonReader reader = new JsonTextReader(new StreamReader(inputStream)))
    using (JsonWriter templateWriter = new JsonTextWriter(new StreamWriter(templateFileName)))
    using (JsonWriter arrayWriter = new JsonTextWriter(new StreamWriter(arrayFileName)))
    {
        if (reader.Read() && reader.TokenType == JsonToken.StartObject)
        {
            templateWriter.WriteStartObject();
            while (reader.Read() && reader.TokenType != JsonToken.EndObject)
            {
                string propertyName = (string)reader.Value;
                reader.Read();
                templateWriter.WritePropertyName(propertyName);
                if (propertyName == arrayPropertyName)
                {
                    arrayWriter.WriteToken(reader);
                    templateWriter.WriteStartObject();  // empty placeholder object
                    templateWriter.WriteEndObject();
                }
                else if (reader.TokenType == JsonToken.StartObject ||
                         reader.TokenType == JsonToken.StartArray)
                {
                    templateWriter.WriteToken(reader);
                }
                else
                {
                    templateWriter.WriteValue(reader.Value);
                }
            }
            templateWriter.WriteEndObject();
        }
    }

    // Now read the huge array file and combine each item in the array
    // with the template to make new files
    JObject template = JObject.Parse(File.ReadAllText(templateFileName));
    using (JsonReader arrayReader = new JsonTextReader(new StreamReader(arrayFileName)))
    {
        int counter = 0;
        while (arrayReader.Read())
        {
            if (arrayReader.TokenType == JsonToken.StartObject)
            {
                counter++;
                JObject item = JObject.Load(arrayReader);
                template[arrayPropertyName] = item;
                string fileName = string.Format(@"C:\Temp\output_{0}_{1}_{2}.json",
                                                template["name"], template["age"], counter);

                File.WriteAllText(fileName, template.ToString());
            }
        }
    }

    // Clean up temporary files
    File.Delete(templateFileName);
    File.Delete(arrayFileName);
}

Note the above approach will require double the disk space of the original JSON during processing due to the temporary files. If this is a problem, you can modify the code to download the file twice instead (although this will likely increase the processing time). In the first download, create the template JSON and ignore the array; in the second download, advance to the array and process it with the template as before to create the output files.

Brian Rogers
  • 125,747
  • 31
  • 299
  • 300