1

I have asked similar kind of question before so please excuse if you find it more as repetition though the code here is different which I am trying to fix.

Below is the code where I am reading a JSON file and copying the contents in a Azure Tabular Storage. Json files are being read from a blob storage. Right now I am reading from memory and passing the content to be copied. But considering large Json files might give me a memory exception, I would like to read this as a stream and not store it as memory. How should I do that?

sample Json to read>

{"PartitionKey": "test","RowKey": "7tttt","IdPit": 653,"Class": "A76","Power": 323,"Time": "04/23/2012 18:25:43","bits": "test"}
{"PartitionKey": "test","RowKey": "itttt","IdPit": 432,"Class": "B65","Power": 23,"Time": "04/22/2012 18:25:43","bits": "Ttest"}

Code for reading Json:

List<string> lines = new List<string>();

foreach (var files in recFiles)
{
    Stream data = await DownloadBlob(containerName, fileName, connectionString);
    StreamReader reader = new StreamReader(data, Encoding.UTF8);
    string dataContents = reader.ReadToEnd();
    lines.Add(dataContents);
}    

await PopulateTable(lines);

DownloadBlob:

public async Task<Stream> DownloadBlob(string containerName, string fileName, string connectionString)
{            
    Microsoft.Azure.Storage.CloudStorageAccount storageAccount = Microsoft.Azure.Storage.CloudStorageAccount.Parse(connectionString);
    CloudBlobClient serviceClient = storageAccount.CreateCloudBlobClient();
    CloudBlobContainer container = serviceClient.GetContainerReference(containerName);
    CloudBlockBlob blob = container.GetBlockBlobReference(fileName);

    if (!blob.Exists())
    {
        throw new Exception("Blob Not found");
    }

    return await blob.OpenReadAsync();
}

Reading and uploading Json :

public async Task<List<DynamicTableEntity>> PopulateTable(IEnumerable<string> lines)
{
    var validator = new JsonSchemaValidator();
    var tableData = new List<JObject>();           
            
    // Validate all entries
    foreach (var line in lines)
    {
        if (string.IsNullOrWhiteSpace(line))
            continue;

        var data = JsonConvert.DeserializeObject<JObject>(line);
        ...... // adding to table
    }
}
Ankit Kumar
  • 476
  • 1
  • 11
  • 38
  • Not quite sure what you mean by "read this as a stream and not store it as memory". A stream is generally an in memory buffer for I/O operations. – phuzi Dec 31 '20 at 14:45
  • if you see line string dataContents = reader.ReadToEnd(); in line here I am reading the contents and allocating a memory here, I dont want to do that as it will give me memory exception on reading large files – Ankit Kumar Dec 31 '20 at 14:52
  • 1
    But you're adding the content to your list which is still in memory! How big a content are you expecting? Is this an actual problem or something you think will be a problem in the future? – phuzi Dec 31 '20 at 14:57
  • What does your json look like? It is an large array of small items or one giant json object? – Dan Csharpster Dec 31 '20 at 14:58
  • The size of Json of Json can be around 5GB or more. Exactly what you said is correct and would like to avoid that that is reading data and adding it in list. updating the question with sample json – Ankit Kumar Dec 31 '20 at 18:15
  • @AnkitKumar Do not puck every line of the download in a list, which you will consumed later line-by-line anyway. Instead read one line and directly proceed it by deserilalize it. That way you have only the current line in memory. After that, you read the next line from the `Stream`. – Progman Dec 31 '20 at 18:25
  • That wont impact the performance?? that will be a time consuming way. – Ankit Kumar Dec 31 '20 at 18:29
  • @AnkitKumar Why do you think it would impact the performance? Or why do you think that approach will be time consuming? – Progman Dec 31 '20 at 18:37
  • That sample "JSON" looks to be [Newline Delimited JSON](http://ndjson.org/). You can deserialize NDJSON directly from a stream using Json.NET by setting the `SupportMultipleContent` setting, see [What is the correct way to use JSON.NET to parse stream of JSON objects?](https://stackoverflow.com/q/26601594/3744182) and [Line delimited json serializing and de-serializing](https://stackoverflow.com/q/29729063/3744182). – dbc Dec 31 '20 at 20:48
  • @Program can you give a sample code as what you meant consuming line by line later – Ankit Kumar Dec 31 '20 at 21:17

1 Answers1

0

Something like this might work. I'm assuming this is more of an array of items and I didn't have time to fully test it out.

Updated:


Short Answer:

So I had a chance to look at this closer. Here is the short answer that should give you what you want. However, I don't think you should use this. I explain why in the long answer.

List<string> lines = new List<string>();
JsonSerializer serializer = new JsonSerializer();

foreach (var files in recFiles)
{
    using (Stream data = await DownloadBlob(containerName, fileName, connectionString))
    {
        using (StreamReader streamReader = new StreamReader(data, Encoding.UTF8))
        {
            while (streamReader.Peek() >= 0)
            {
                var line = streamReader.ReadLine();
                PopulateTable(line);
            }
        }
    }
}

Long Answer:

The long answer is that your example json input is not quite valid json. The bad news is that it makes it harder for JSON APIs to work with it. The good news is that since each line is valid json, you can just use normal streaming and just read each line, one at a time. To fix this to use proper JSON, it will take 2 steps:

  1. Change your json to array format
[
{"PartitionKey": "test","RowKey": "7tttt","IdPit": 653,"Class": "A76","Power": 323,"Time": "04/23/2012 18:25:43","bits": "test"},
{"PartitionKey": "test","RowKey": "itttt","IdPit": 432,"Class": "B65","Power": 23,"Time": "04/22/2012 18:25:43","bits": "Ttest"},
{"PartitionKey": "test","RowKey": "1itttt","IdPit": 432,"Class": "B65","Power": 23,"Time": "04/22/2012 18:25:43","bits": "Ttest"},
{"PartitionKey": "test","RowKey": "2itttt","IdPit": 432,"Class": "B65","Power": 23,"Time": "04/22/2012 18:25:43","bits": "Ttest"},
{"PartitionKey": "test","RowKey": "3itttt","IdPit": 432,"Class": "B65","Power": 23,"Time": "04/22/2012 18:25:43","bits": "Ttest"}
]

and

  1. read it in as json. Update your code to something like this:

     List<string> lines = new List<string>();
     JsonSerializer serializer = new JsonSerializer();
    
     foreach (var files in recFiles)
     {
    
         using (Stream data = await DownloadBlob(containerName, fileName, connectionString))
         {
             using (StreamReader streamReader = new StreamReader(data, Encoding.UTF8))
             {
                 using (JsonReader reader = new JsonTextReader(streamReader))
                 {
                     while (reader.Read())
                     {
                         if (reader.TokenType == JsonToken.StartObject)
                         {
                             var jobject = serializer.Deserialize<JObject>(reader);
                             PopulateTable(jobject.ToString(Newtonsoft.Json.Formatting.None));
                         }
                     }
                 }
             }
         }
     }
    
Dan Csharpster
  • 2,662
  • 1
  • 26
  • 50