1

I'm trying to read rows from a massive (1.2GB JSON file). I want to read attributes without having to define a class first. I had to redefine my initial question, as I also want to be able to access a nested value by using a single string path like so: facilities.totalSize.value

Because of the change in requirements, I think the best approach seems to be this as suggested by tgolisch, but my code returns a "Object reference not set to an instance of an object." when I try to access the value.

Dim req As HttpWebRequest = CType(WebRequest.Create("https://www.example.com/sample.json"), HttpWebRequest)
Using resp = req.GetResponse()
    Using stream = resp.GetResponseStream()
        Using reader = New JsonTextReader(New StreamReader(stream))
            While reader.Read()
                If reader.TokenType = JsonToken.StartObject Then
                    Dim jt As JToken = CType(reader.Value, JToken)
'here I also tried creating a jt variable like so: Dim jt As JToken = JToken.Load(reader) but getting the same error
                    Log("jt totalSize:" + jt.SelectToken("facilities.totalSize.value").Value(Of Object).ToString())
                End If
                'how can I retrieve the value for "facilities.totalSize.value" attribute here?
            End While
        End Using
    End Using
End Using

sample.json

[
   {
      "countryid":1,
      "price":2997,
      "facilities":{
         "totalSize":{
            "value":80
         }
      }
   },
   {
      "countryid":1,
      "price":250,
      "facilities":{
         "totalSize":{
            "value":30
         }
      }
   }
]

UPDATE 1

I'd like the code to be flexible and work on this format too in listings.json

{
   "generatedAt":"2022-05-02 02:03:25",
   "listings":[
      {
         "countryid":1,
         "publishdate":"2022-04-02 02:03:25",
         "location":{
            "neighborhood":"Finthen",
            "city":"Mainz",
            "country":"Germany"
         },
         "facilities":{
            "bedrooms":{
               "value":2
            },
            "totalSize":{
               "value":"100"
            }
         }
      },
      {
         "countryid":2,
         "publishdate":"2022-02-02 02:03:24",
         "location":{
            "neighborhood":"Ubers",
            "city":"NYC",
            "country":"USA"
         },
         "facilities":{
            "bedrooms":{
               "value":3
            },
            "totalSize":{
               "value":"150"
            }
         }
      }
   ],
   "count":1077
}

I have many differently formatted JSON files because I work with different data partners, some provide feeds that contain tokens called "products", others called "listings", others have no wrapper objects at all. My question now includes samples of both.

Hope that make sense on where I'm coming from.

I got your first code example to work on sample.json, but as you said, that code does not work listings.json

So, I have to go for example 2 and use the regex. I tried on listings.json:

Using stream = resp.GetResponseStream()
    Dim regex As Regex = New Regex("^\[[0-9]+\]\.facilities\.totalSize\.value$", RegexOptions.CultureInvariant Or RegexOptions.Compiled Or RegexOptions.Singleline)
    For Each value As Decimal In JsonExtensions.DeserializeItemsByPath(Of Decimal)(stream, regex)
        Log("totalSize", String.Format("totalSize: {0}", value.ToString))
    Next
End Using

But it does not return any values, why?

UPDATE 2

Weird, https://codebeautify.org/jsonviewer showed my earlier listings.json as valid. I updated listings.json with more limited fields so it's less cumbersome to read, I validated the format before pasting here.

I incorrectly assumed that having code for selecting a single field would give me enough for me to extend myself to also use that code for multiple field selection.

Anyway, each listing has dozens of fields, but I kept it to a few in my sample json for legibility. What I want to extract from each listing are fields from different types, i.e.:

  • countryid of type integer.
  • country of type string
  • publishdate of type date

And there might be more types in the future, like decimal, boolean etc. I'm not sure what your comment about "bedrooms" means as my regex example was about "totalSize".

Here's what I have now as code, where each For each block returns a value (except for kindLabel because I think the string value conflicts with the Decimal value selector), so my regex works. But I don't want to For each through each occurrence of totalSize I want to For each through each listing and get the values totalSize, bedrooms, kindLabel etc. and THEN go to the next item. So I need to process each individual listing, get its attributes/tokens and then move to the next listing.

Dim regex As Regex = New Regex("^listings\[[0-9]+\]\.facilities\.totalSize\.value$",
               RegexOptions.CultureInvariant Or RegexOptions.Compiled Or RegexOptions.Singleline)

For Each value As Decimal In JsonExtensions.DeserializeItemsByPath(Of Decimal)(stream, regex)
    ReportError("totalSize", String.Format("totalSize: {0}", value.ToString))
Next

regex = New Regex("^listings\[[0-9]+\]\.facilities\.bedrooms\.value$", RegexOptions.CultureInvariant Or RegexOptions.Compiled Or RegexOptions.Singleline)
For Each value As Decimal In JsonExtensions.DeserializeItemsByPath(Of Decimal)(stream, regex)
    ReportError("bedrooms", String.Format("bedrooms: {0}", value.ToString))
Next

regex = New Regex("^listings\[[0-9]+\]\.kindLabel$", RegexOptions.CultureInvariant Or RegexOptions.Compiled Or RegexOptions.Singleline)
For Each value As Decimal In JsonExtensions.DeserializeItemsByPath(Of Decimal)(stream, regex)
    ReportError("kindLabel", String.Format("kindLabel: {0}", value.ToString))
Next

I hope the use case and sample data are up to par now :) Thanks for your effort once again!

UPDATE 3

Thank again @dbc. I think I can summarize the thread so far and summarizing what I've learned to this last question: "how can I read multiple values from a single listing object before moving to the next?"

    Using stream = resp.GetResponseStream()

        'HERE I WANT TO READ VALUE of `facilities.totalSize.value` from first listing object, so "100"
        Dim regex As Regex = New Regex("^listings\[[0-9]+\]\.facilities\.totalSize\.value$",
                       RegexOptions.CultureInvariant Or RegexOptions.Compiled Or RegexOptions.Singleline)

        'HERE I WANT TO READ VALUE of `facilities.bedrooms.value` from first listing object, so "2"
        regex = New Regex("^listings\[[0-9]+\]\.facilities\.bedrooms\.value$", RegexOptions.CultureInvariant Or RegexOptions.Compiled Or RegexOptions.Singleline)

        'HERE I WANT TO READ VALUE of `publishdate` from first listing object, so "2022-04-02 02:03:25"
        regex = New Regex("^listings\[[0-9]+\]\.publishdate$", RegexOptions.CultureInvariant Or RegexOptions.Compiled Or RegexOptions.Singleline)

        'now move to next listing

    End Using
    
Adam
  • 6,041
  • 36
  • 120
  • 208
  • 1
    Use `reader.TokenType` to determine where you are in the json data. When on the "countryid" node, `TokenType` will be `PropertyName` and `reader.Value` will be "countryId" etc. When `TokenType` is of `StartObject` type, then you can call `var obj = JsonSerializer.CreateDefault().Deserialize(reader)` to get the "local" object (and continue reading if you want). – Simon Mourier Oct 11 '22 at 12:32
  • Have you tried jsonpath? https://stackoverflow.com/questions/41643843/jsonpath-with-jsontextreader-token-at-a-time – tgolisch Oct 11 '22 at 14:43
  • @SimonMourier thank you for your suggestion, my requirements changed slightly so I think I needed to alter my ask, if you could have another look? – Adam Oct 15 '22 at 17:49
  • @tgolisch thanks, I think that route would work for me, though I can't seem to access the value, if you could help me out? – Adam Oct 15 '22 at 17:49
  • JSON is not XML, it doesn't have elements with attributes, it has arrays and objects. The objects have nested properties which have nested values, but the properties and values are children of the object, not direct attributes. – dbc Oct 16 '22 at 19:31
  • What you need to do instead is to stream through the array and load each object as a small chunk as shown in [this answer](https://stackoverflow.com/a/43747641) by [nocodename](https://stackoverflow.com/users/7912895/nocodename) to [How to parse huge JSON file as stream in Json.NET?](https://stackoverflow.com/q/43747477). Does that answer your question sufficiently, or do you need help rewriting that answer to vb.net? – dbc Oct 16 '22 at 19:32
  • My comment is still valid. You can read the json stream chunk by chunk you don't specifically need json path. – Simon Mourier Oct 18 '22 at 06:01
  • @SimonMourier is there a difference in performance that you're aware of? – Adam Oct 19 '22 at 10:47
  • @dbc Looking at that post it looks like nocodename prefers the answer by spender, still it's quite complex to me – Adam Oct 19 '22 at 10:47
  • Not specifically, I don't know how json path is implemented, but you can't beat the JsonReader (at best json path uses it). – Simon Mourier Oct 19 '22 at 11:13
  • *I'm not sure what your comment about "bedrooms" means as my regex example was about "totalSize".* -- You asked why your regex did not work. In that [version](https://stackoverflow.com/posts/74027865/revisions) of your question, your regex showed `.facilities\.totalSize\.value` but your JSON had `facilities.bedrooms.value` which explains in part why the regex did not work. – dbc Oct 22 '22 at 17:01
  • @dbc thanks. I added update 3, I think that's the final part of my question that I'm currently struggling with – Adam Oct 23 '22 at 11:01

1 Answers1

1

If you have a huge JSON file consisting of a top-level array of fairly small objects (which you do in this case), you can stream through the file and deserialize each object lazily by adapting this answer by nocodename to How to parse huge JSON file as stream in Json.NET? to VB.NET.

First, define the following two static methods:

Public Module JsonExtensions
    Iterator Function DeserializeObjects(Of T)(ByVal stream As Stream, ByVal Optional settings As JsonSerializerSettings = Nothing) As IEnumerable(Of T)
        Using textReader = New StreamReader(stream, Encoding.UTF8, true, 1024, true)
            For Each item as T In DeserializeObjects(Of T)(textReader, settings)
                Yield item
            Next
        End Using
    End Function

    ' Adapted from this answer https://stackoverflow.com/a/43747641/3744182
    ' By https://stackoverflow.com/users/7912895/nocodename
    ' To https://stackoverflow.com/questions/43747477/how-to-parse-huge-json-file-as-stream-in-json-net
    Iterator Function DeserializeObjects(Of T)(ByVal textReader As TextReader, ByVal Optional settings As JsonSerializerSettings = Nothing) As IEnumerable(Of T)
        Dim serializer = JsonSerializer.CreateDefault(settings)
        Using reader = New JsonTextReader(textReader) With { .CloseInput = False }
            While reader.Read()
                If reader.TokenType = JsonToken.StartObject Then
                    Yield serializer.Deserialize(Of T)(reader)
                End If
            End While
        End Using
    End Function
End Module

And now you will be able to do:

For Each jt As JObject in JsonExtensions.DeserializeObjects(Of JObject)(stream)
    Dim valueToken as JToken = jt.SelectToken("facilities.totalSize.value")
    If (valueToken IsNot Nothing)
        Console.WriteLine("totalSize: {0}", valueToken)
    End If
Next

This method works by scanning through the file's JSON tokens until it encounters a { marking the start of a JSON object, then deserializes that object to the requested type and Yields its value. This works as needed for JSON file consisting of a top-level array of objects like the one shown in the question.

If on the other hand your huge JSON file has a more complex schema, for instance:

  1. A top-level wrapper object such as:

    {
        "data" : [ /* Some huge array of objects */ }
    }
    
  2. Array items that are themselves huge.

  3. Nested large arrays.

You may wish to adapt this answer by spender to deserialize only those tokens that match a specific JSONPath.

First, add the following static methods to your JsonExtensions Module:

Public Module JsonExtensions
    Iterator Function DeserializeItemsByPath(Of T)(ByVal stream As Stream, regex as Regex, ByVal Optional settings As JsonSerializerSettings = Nothing) As IEnumerable(Of T)
        Using textReader = New StreamReader(stream, Encoding.UTF8, true, 1024, true)
            For Each item as T In DeserializeItemsByPath(Of T)(textReader, regex, settings)
                Yield item
            Next
        End Using
    End Function

    ' Adapted from this answer https://stackoverflow.com/a/56411253/3744182
    ' By https://stackoverflow.com/users/14357/spender
    ' To https://stackoverflow.com/questions/43747477/how-to-parse-huge-json-file-as-stream-in-json-net
    Iterator Function DeserializeItemsByPath(Of T)(ByVal textReader As TextReader, ByVal regex as Regex, ByVal Optional settings As JsonSerializerSettings = Nothing) As IEnumerable(Of T)
        Dim serializer = JsonSerializer.CreateDefault(settings)
        Using reader = New JsonTextReader(textReader) With { .CloseInput = False }
            While reader.Read()
                If reader.TokenType <> JsonToken.PropertyName And regex.IsMatch(reader.Path)
                    Yield serializer.Deserialize(Of T)(reader)
                End If
            End While
        End Using
    End Function
End Module

And now you will be able stream through the file and lazily deserialize only the [*].facilities.totalSize.value values as follows:

Dim regex as Regex = new Regex("^\[[0-9]+\]\.facilities\.totalSize\.value$", RegexOptions.CultureInvariant Or RegexOptions.Compiled Or RegexOptions.Singleline)
Console.WriteLine(regex)
For Each value As Decimal in JsonExtensions.DeserializeItemsByPath(Of Decimal)(stream, regex)
    Console.WriteLine("totalSize: {0}", value)
Next

Notes:

  • The regex method is clearly the more flexible of the two options, however regex matching itself can have performance implications that you might want to avoid in the simpler case of a top-level array of small objects.

  • If the file consists of top-level wrapper JSON object with a large array property you should not use the first method, as the entire file will be read and returned as a c# single object.

  • No matter which option you choose, only one object of type T is deserialized into memory at a time. As long as you do not accumulate all the objects in some list, as you iterate through the enumerable, the garbage collector will free up previous instances of T making total memory requirements constant in the size of the outer array.

  • The regex ^\[[0-9]+\]\.facilities\.totalSize\.value$ corresponds to the JSONPath wildcard expression [*].facilities.totalSize.value selecting the values facilities.totalSize.value inside every item of the outermost array.

  • RegexOptions.CultureInvariant is used to ensure that path matching is language-independent.

  • RegexOptions.Compiled is used to improve performance, as you will be making many calls to regex.IsMatch().

  • RegexOptions.Singleline is used to correctly handle the (unlikely) case of a JSON property name that includes a newline character.

  • You may also want to use RegexOptions.IgnoreCase to support both pascal and camel casing.

Demo fiddle showing both options here.

UPDATE 1

If your various JSON samples differ in whether their outer containers are JSON objects or arrays, then JsonExtensions.DeserializeItemsByPath() will be the more useful choice.

The reason that your updated code does not work with your second JSON sample is that the regex you use,

^\[[0-9]+\]\.facilities\.totalSize\.value$

Does not match the JSONPath of the "value" tokens, which is:

listings[*].facilities.bedrooms.value

If you are ever uncertain about the path to use for a particular JSON token, you can load your sample JSON file into a JToken hierarchy, iterate through the tokens until you find the one you want, then print the value of JToken.Path. For example:

Using textReader = New StreamReader(FileName)
Using jsonReader = new JsonTextREader(textReader)
    Dim root = CType(JToken.Load(jsonReader), JContainer)

    Dim propertyName = "value"
    Dim actualPath = root.DescendantsAndSelf().Where(Function(t) t.Path.EndsWith(propertyName)).Select((Function(t) t.Path)).FirstOrDefault()

    Console.WriteLine("Searching for a path ending in ""{0}"" resulted in actual path ""{1}""", propertyName, actualPath)
End Using
End Using

Prints the value

Searching for a path ending in "value" resulted in actual path "listings[0].facilities.bedrooms.value"

Thus the following Regex will work:

Dim regex as Regex = new Regex("^listings\[[0-9]+\]\.facilities\.totalSize\.value$", _
                               RegexOptions.CultureInvariant Or RegexOptions.Compiled Or RegexOptions.Singleline)

However, since it seems as though your various JSON files differ in their outer container objects, but you always want to select a value whose path ends in facilities.bedrooms.value, it is probably easiest to use a Regex that matches only the end of the path, like so:

Dim regex as Regex = new Regex("facilities\.totalSize\.value$", _
                               RegexOptions.CultureInvariant Or RegexOptions.Compiled Or RegexOptions.Singleline)

Demo fiddle #2 here.

UPDATE 3

After this update, it appears you need to scan through a huge JSON file and read multiple predefined values from objects found at some specific path, e.g.:

listings[*].publishdate
listings[*].facilities.totalSize.value
listings[*].facilities.bedrooms.value

Furthermore, you need to capture the groupings of publishdate, facilities.totalSize.value and facilities.bedrooms.value within a single object.

You can accomplish this by scanning through the JSON file for objects that match the lowest common parent of the properties of interest (here listings[*]), deserializing to some intermediate DTO that captures the properties of interest, then finally projecting to a final DTO that contains only those nested properties you need.

First, define the following types:

Friend Class ListingsIntermediateDTO
    Property publishdate As String
    Property facilities as JToken
End Class

Public Class ListingsDTO
    ' kindLabel
    Property publishdate As String
    ' .totalSize.value
    Property totalSizeValue as Decimal?
    ' .bedrooms.value
    Property bedroomsValue as Decimal?
End Class

And now you can write the following method, selecting out the publishdate, facilities.totalSize.value and listings[*].facilities.bedrooms.value values and returning them in a single DTO:

Public Shared Iterator Function GetListings(ByVal fileName as String) as IEnumerable(Of ListingsDTO)
    Using stream = File.OpenRead(FileName)
        Dim regex as Regex = new Regex("^(listings\[[0-9]+\]|\[[0-9]+\])$", _
                                       RegexOptions.CultureInvariant Or RegexOptions.Compiled Or RegexOptions.Singleline)
        For Each value in _
            JsonExtensions.DeserializeItemsByPath(Of ListingsIntermediateDTO)(stream, regex) _
                .Select(Function(dto) new ListingsDTO With { 
                    .publishdate = dto.publishdate, 
                    .totalSizeValue = dto.facilities.SelectTokens("totalSize.value").Select(Function(t) t.ToObject(Of Decimal?)()).FirstOrDefault(),
                    .bedroomsValue = dto.facilities.SelectTokens("bedrooms.value").Select(Function(t) t.ToObject(Of Decimal?)()).FirstOrDefault()
                })
            Yield value
        Next
    End Using
End Function 

Notes:

  • According to the JSON specification, a JSON object is defined as follows:

    An object is an unordered set of name/value pairs. An object begins with { left brace and ends with } right brace. Each name is followed by :colon and the name/value pairs are separated by ,comma.

    Thus your code should not make any assumptions that, within a specific object, one property such as publishdate appears before another such as facilities. Unless you have complete control over how the JSON is serialized, the JSON provider may change the order in which properties are serialized and claim that the change is non-breaking.

  • The intermediate DTO could be a JObject. The only purpose in defining a typed intermediate DTO ListingsIntermediateDTO is to reduce memory footprint by skipping nonrequired JSON properties.

    If you would prefer to eliminate ListingsIntermediateDTO, write GetListings() as follows:

    Public Shared Iterator Function GetListings(ByVal fileName as String) as IEnumerable(Of ListingsDTO)
        Using stream = File.OpenRead(FileName)
            Dim regex as Regex = new Regex("^(listings\[[0-9]+\]|\[[0-9]+\])$", _
                                           RegexOptions.CultureInvariant Or RegexOptions.Compiled Or RegexOptions.Singleline)
            For Each value in _
                JsonExtensions.DeserializeItemsByPath(Of JObject)(stream, regex) _
                    .Select(Function(dto) new ListingsDTO With { 
                        .publishdate = dto.SelectTokens("publishdate").Select(Function(t) t.ToObject(Of String)()).FirstOrDefault(),
                        .totalSizeValue = dto.SelectTokens("facilities.totalSize.value").Select(Function(t) t.ToObject(Of Decimal?)()).FirstOrDefault(),
                        .bedroomsValue = dto.SelectTokens("facilities.bedrooms.value").Select(Function(t) t.ToObject(Of Decimal?)()).FirstOrDefault()
                    })
                Yield value
            Next
        End Using
    End Function 
    
  • Some of your JSON files have wrapper objects listings[*], while others do not, and have root level arrays [*]. You can support both schemas by using the regex alternation construct (A|B) to match either. Specifically, the regex

    ^(listings\[[0-9]+\]|\[[0-9]+\])$
    

    Supports scanning through either a top-level array, or a top-level object with a listings property with an array value.

Demo fiddle #3 using ListingsIntermediateDTO here and #4 without ListingsIntermediateDTO here.

dbc
  • 104,963
  • 20
  • 228
  • 340
  • Wow, thank you for the extended answer, I'll have to dig into this a bit more :) A few questions come to mind: 1. I want to not only access `facilities.totalSize.value` from each array item, but each key of that array item, so `facilities.totalSize.value`, `price`, `countryid`, because I need to add those keys as a single record into my database. Would that impact which of the two methods you propose is the best to use? 2. in the second method you mention "Array items that are themselves huge." What is considered huge? I have items in an array that are 5-20Kb each. – Adam Oct 20 '22 at 19:19
  • I'd like to adopt a method that is flexible and would allow me to parse JSON files that consist of only 1 very large array with items, as well as ones where there may be a top-level wrapper object. I actually fail to see how a top-level wrapper object influences the recommended methods, because in both cases I'm parsing a JSON array? Also for completeness I added one other JSON structure where I'd like to use the code as well. – Adam Oct 20 '22 at 19:19
  • 1
    @Adam - *I actually fail to see how a top-level wrapper object influences the recommended methods, because in both cases I'm parsing a JSON array?* - the first method reads until it encounters the beginning of an object, then loads that. If the root container is a wrapper object, it would load the entire thing, which you don't want. Better to use the second method with a regex like `^wrapper_property_name\[[0-9]+\]$`. – dbc Oct 20 '22 at 19:34
  • Seeing this `For Each value As Decimal in JsonExtensions.DeserializeItemsByPath(Of Decimal)(stream, regex)` it looks like traversing the entire stream to check a specific path for each item in the total JSON file. What I want is extract each array item and then process. Pseudo: 1. get result 1 from array in sample.json/get result1 from `listings` array into a variable `product` 2. for `product` variable get "price" value 3. for `product` variable get "countryid" value 4. back to top and load result 2 in `product` Your current code seems to process only 1 specific key for each result? – Adam Oct 21 '22 at 10:42
  • @Adam - *it looks like traversing the entire stream to check a specific path for each item in the total JSON file.* - it's not traversing, it's **streaming**. Loading each token individually rather than the entire file. But otherwise you are correct, I wrote my answer that way because that is what was shown in your question... – dbc Oct 21 '22 at 19:22
  • @Adam ... If you need something else, please [edit](https://stackoverflow.com/q/74027865/3744182) your question to include a [mcve], specifically sample JSON that includes all required properties. Your 2nd JSON sample includes `listings[*].countryid` and `listings[*].costs.price` but there is no `product` property shown in either of your JSON samples so I'm not really sure what you need. – dbc Oct 21 '22 at 19:22
  • my bad on the limited explanation, I updated the JSON file with what I use and my attempt at extracting – Adam Oct 21 '22 at 21:51
  • @Adam - your newest `listings.json` JSON sample is malformed, upload to https://jsonformatter.curiousconcept.com/ and you will see several errors. Might I request as well-formed sample please? – dbc Oct 21 '22 at 22:28
  • @Adam - maybe this is the correct JSON? https://dotnetfiddle.net/ismNoo – dbc Oct 21 '22 at 23:39
  • @Adam - I made a guess as to what your well-formed JSON should look like and I updated the answer to **UPDATE 1**. However, in your comments you mention needing to select multiple values. If you still need help with this, please [edit](https://stackoverflow.com/posts/74027865/edit) your question to add sample JSON + requirements in the question (i.e. an **UPDATE 2** I reckon). – dbc Oct 22 '22 at 00:37
  • Traveling so won't be able to test this, I'm going to award the bounty regardless now for your massive effort which I highly appreciate. Will test in a few days. Thanks! – Adam Oct 25 '22 at 07:58