0

We are working on a program to fetch slide image data from a group of servers that doesn't have a consistent schema setup (I'm worried it's invalid, but I'm not proficient enough to make that call). We have no influence over the servers as independent-unrelated researchers.

The data was entered manually (for the most part), through a large series of forms (n>50), with inconsistent fields (data goes back to the 90's). Here is an example of a response:

{
"form12873": [

    {
        "id": "9202075838",
        "timestamp": "2015-06-25 10:24:51",
        "user_agent": "Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit\/600.6.3 (KHTML, like Gecko) Version\/8.0.6 Safari\/600.6.3",
        "remote_addr": "[Re.dact.ed]",
        "processed": "1",
        "data": {
            "33885124": {
                "field": "33885124",
                "value": "CDat Lab",
                "flat_value": "CDat Lab",
                "label": "Completed by:",
                "type": "select"
            },

            ''**Several more fields as above**''...

            "33884660": {
                "field": "33884660",
                "value": {
                    "slideX": "2456123",
                    "slideY": "456632",
                    "label": "K-20150322148",
                    "approved": "1",
                    "score": "30144"
                },
                "flat_value": "slideX = 2456123\nslideY = 456632\nlabel = K-20150322148\napproved = 1\nscore = 30144",
                "label": "Slide Stats:",
                "type": "slidestats"
            },

            ''**Some of the fields are as above...

            "31970564": {
                "field": "31970564",
                "value": [
                    "System",
                    "Crated",
                    "Mirax",
                    "NanoZoomer",
                    "ThinPrep",
                    "Aperio",
                    "Intellisite"

                ],
                "flat_value": "System\nCrated\nMirax\nNanoZoomer\nThinPrep\nAperio\nIntellisite",
                "label": "System Information",
                "type": "checkbox"
            },

            ''**Some of the values are Arrays...

            "33883781": {
                "field": "33883781",
                "selection": "Retain",
                "label": "4. Retain\/Remove\/Review",
                "type": "selectdrop"
            },

            ''**Some of the fields don't have the same children

            "52792890": {
                "field": "52792890",
                "image": "'A really large byte[], removed for ease of reading'",
                "type": "image"
            }

            ''**Somewhere near the end of each response is the actual image...
        }
    },

    {
        "id": "33884681",
            ''**Then it continues on as above until the end:
    }
], "total": 170, "pages": 5, "pretty_id": "478125624983" }

In the past when I've been able to model/class for the structure of the JSON, I've known how to handle it (make a data class with field, value, etc defined).

Trying solutions like:

var result = JsonConvert.DeserializeObject<List<Dictionary<string, 
                            Dictionary<string, string>>>>(content);

Always led to array errors or cast issues (even if direct casts were added). I am able to get the actual first array using:

    Public Shared Function Tabulate(json As String) As DataTable
    Dim jsonLinq = Newtonsoft.Json.Linq.JObject.Parse(json)

    ' Find the first array using Linq

    Dim srcArray = jsonLinq.Descendants().Where(Function(d) TypeOf d Is JArray).First()
    Dim trgArray = New Newtonsoft.Json.Linq.JArray()
    For Each row As JObject In srcArray.Children(Of JObject)()
        Dim cleanRow = New JObject()
        For Each column As JProperty In row.Properties()
            ' Only include JValue types
            If TypeOf column.Value Is JValue Then
                cleanRow.Add(column.Name, column.Value)
            End If
        Next

        trgArray.Add(cleanRow)
    Next


    Return JsonConvert.DeserializeObject(Of DataTable)(trgArray.ToString())
End Function

My end goal is also to get to a data table, and looping/the image byte have me concerned about trying to regressively go down to further children. My attempts to then deserialize using the first array have then come up nill.

If there is a quick way to handle this, I'd love the solution. If the problem is that I'm trying to handle crap JSON, I'd love a reference to where the current standard is being broken (so I can at least try to get the other institution to change their servers). That said, I'll probably have to deal with it anyway, even if it is loops.

*Note: The project was started in VB.net, so we've kept it that way, but I may decide to port to C# anyway. Code in either would be great.

Below is an unmarked example of the Json that should be usable for testing. My end goal is to flatten this into a datatable:

{
"form12873": [
    {
        "id": "9202075838",
        "timestamp": "2015-06-25 10:24:51",
        "user_agent": "Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit\/600.6.3 (KHTML, like Gecko) Version\/8.0.6 Safari\/600.6.3",
        "remote_addr": "[Re.dact.ed]",
        "processed": "1",
        "data": {
            "33885124": {
                "field": "33885124",
                "value": "CDat Lab",
                "flat_value": "CDat Lab",
                "label": "Completed by:",
                "type": "select"
            },
            "33884660": {
                "field": "33884660",
                "value": {
                    "slideX": "2456123",
                    "slideY": "456632",
                    "label": "K-20150322148",
                    "approved": "1",
                    "score": "30144"
                },
                "flat_value": "slideX = 2456123\nslideY = 456632\nlabel = K-20150322148\napproved = 1\nscore = 30144",
                "label": "Slide Stats:",
                "type": "slidestats"
            },
            "31970564": {
                "field": "31970564",
                "value": [
                    "System",
                    "Crated",
                    "Mirax",
                    "NanoZoomer",
                    "ThinPrep",
                    "Aperio",
                    "Intellisite"
                ],
                "flat_value": "System\nCrated\nMirax\nNanoZoomer\nThinPrep\nAperio\nIntellisite",
                "label": "System Information",
                "type": "checkbox"
            },



            "33883781": {
                "field": "33883781",
                "selection": "Retain",
                "label": "4. Retain\/Remove\/Review",
                "type": "select"
            }
        }
    }
], "total": 170, "pages": 5, "pretty_id": "478125624983" }
Community
  • 1
  • 1
Atl LED
  • 656
  • 2
  • 8
  • 31
  • maybe the accepted answer here will help? http://stackoverflow.com/questions/947241/how-do-i-create-dynamic-properties-in-c – Muckeypuck Mar 22 '17 at 19:11
  • @Muckeypuck Wouldn't that only work if all the children under each node where uniform? IE the items under "data" that didn't have the same number/types of properties? I have been trying to implement dynamic properties, but so far when I tried that linked solution, I've still failed at the deserialization. That may be due to my lack of understanding, so I will keep trying at it. – Atl LED Mar 22 '17 at 19:17
  • 3
    After several approaches, the best solution for me to read from poor/unpredictable JSON data was to parse it into a `JToken` object and use `.SelectTokens` and JSONPath to retrieve what I needed or discover it wasn't there, without my code crashing in the middle of that. Is that an option for you? – VBobCat Mar 22 '17 at 20:06
  • @VBobCat I had never heard of JSONPath before...how wonderful...playing with it now. – Atl LED Mar 22 '17 at 20:10
  • @VBobCat this would then require I know the field names for all the data I needed to actually retrieve, right? I could try to make that list, but it's a bit daunting given the size. If I'm misunderstanding, could you post up a solution for me to try? – Atl LED Mar 22 '17 at 20:19
  • I'd like to help you further, but it's not easy without knowing the details and having a valid chunk of data to play with. As far as I can tell, you should iterate through the children of your "data" token, and create a table with as many columns as the fields you can foresee, plus a string one for you to dump the "raw" Json of that branch. So you can give your data a first manageable structure, an then perform even more sophisticated treatment to those items you couldn't retrieve at the first time. This is, I guess, the way I'd try to deal with the problem. I hope it helps. – VBobCat Mar 22 '17 at 20:35
  • 1
    By the way, you can debug your JsonPath expressions [here](http://jsonpath.com/) – VBobCat Mar 22 '17 at 20:39
  • @VBobCat added a "raw" JSON that should be usable for testing. I'm still running into issues using JSONpath with nodes that end as arrays or strings depending (not a string array with 1 string, just returns string as part of 'value'). I don't know if this is enough to post an answer, but I'd hope that it is. – Atl LED Mar 28 '17 at 16:02
  • I can't do it right now, but will take a look on it ASAP. – VBobCat Mar 28 '17 at 16:03
  • @VBobCat Thanks. In essence I was playing around with `$.form12873[0].data[*].value`, but running into problems with the values that were further lists. – Atl LED Mar 28 '17 at 16:07

2 Answers2

1

It is possible to add DataColumns to a DataTable even when it already contains DataRows.

I don't do JSON much, but my general approach with dodgy XML is to decompose into a stream of key-value pairs where the key is the XPATH "address" and the value is the content of the node (excluding child nodes), then loop through the stream to build the DataTable. Maybe a similar approach can be taken here with JSONPath.

SSS
  • 4,807
  • 1
  • 23
  • 44
  • Could you put up an example of this, even if using XML? I'm familiarizing myself with JSONPath now. Right now I'm running into an issue going down past the 3rd node (ie from above, `form12873`->`data`->`value`->{values}) when the values are themselves an array. – Atl LED Mar 23 '17 at 13:13
  • I added a working JSON example to the end without my added markup, this should be usable for testing. I keep running into cannot deserialize errors, regardless of whether I deal with the node as an Jobject or Jarray. If you could make an example with the above string, I'd accept. – Atl LED Mar 28 '17 at 15:58
1

The ugly contraption below is able to do (roughly) what you want. Feed you json source string as parameter to DeserializeToDataTable and collect the result datatable. It worked on your sample. I can't guarantee it will work in the rest of your data. The purpose here is to provide a working start kit which you can study, understand, debug and adapt to suit your needs.

Private Function DeserializeToDataTable(ByVal jsource As String)
    Dim JRootObject = JObject.Parse(jsource)
    Dim Children = JRootObject.SelectTokens("$..data.*").ToArray
    Dim Records = Children.OfType(Of JObject).ToArray
    Dim dicList As New List(Of Dictionary(Of String, Object))
    For Each rec In Records
        dicList.Add(DeserializeToDictionary(rec))
    Next
    Dim fieldnames = dicList.SelectMany(Function(d) d.Keys).Distinct.ToArray
    Dim dt As New DataTable
    For Each fieldname In fieldnames
        dt.Columns.Add(fieldname, GetType(Object))
    Next
    Dim row As DataRow
    For Each dic In dicList
        row = dt.NewRow
        For Each kvp In dic
            row.SetField(kvp.Key, kvp.Value)
        Next
        dt.Rows.Add(row)
    Next
    Return dt
End Function

Private Function DeserializeToDictionary(ByVal json_object As JObject) As Dictionary(Of String, Object)
    Dim dic = New Dictionary(Of String, Object)
    For Each field In json_object.Properties
        Select Case field.Value.Type
            Case JTokenType.Array
                Dim subobject = New JObject
                Dim item = 0
                For Each token In field.Value
                    subobject("item" & item) = token
                    item += 1
                Next
                Dim subdic = DeserializeToDictionary(subobject)
                For Each kvp In subdic
                    dic(kvp.Key) = kvp.Value
                Next
            Case JTokenType.Boolean
                dic(field.Name) = field.Value.ToObject(Of Boolean)
            Case JTokenType.Bytes
                dic(field.Name) = field.Value.ToObject(Of Byte())
            Case JTokenType.Date
                dic(field.Name) = field.Value.ToObject(Of Date)
            Case JTokenType.Float
                dic(field.Name) = field.Value.ToObject(Of Double)
            Case JTokenType.Guid
                dic(field.Name) = field.Value.ToObject(Of Guid)
            Case JTokenType.Integer
                dic(field.Name) = field.Value.ToObject(Of Integer)
            Case JTokenType.Object
                Dim subdic = DeserializeToDictionary(field.Value)
                For Each kvp In subdic
                    dic(kvp.Key) = kvp.Value
                Next
            Case JTokenType.String
                Try
                    dic(field.Name) = field.Value.ToObject(Of String)
                Catch ex As Exception
                    dic(field.Name) = field.Value.ToObject(Of Object)
                End Try
            Case JTokenType.TimeSpan
                dic(field.Name) = field.Value.ToObject(Of TimeSpan)
            Case Else
                dic(field.Name) = field.Value.ToString
        End Select
    Next
    Return dic
End Function

You must be aware of this when using code above:

  1. It uses recursion to flatten a multi-branch structure. So,

    {
        "A":"aaaa",
        "B":"bbbb",
        "C":{
                "D":"dddd",
                "E":"eeee",
                "F":"ffff"
            }
        }
    }
    

    will become

    A   |B   |D   |E   |F
    ----+----+----+----+----
    aaaa|bbbb|dddd|eeee|ffff
    
  2. The way I did assumes there won't be duplications when flattening; should there be those, it will preserve just the last. So,

    {
        "A":"aaaa",
        "B":"bbbb",
        "C":{
                "D":"d1d1",
                "E":"e1e1",
                "F":"f1f1"
            },
        "G":{
                "D":"d2d2",
                "E":"e2e2",
                "F":"f2f2"
            }
        }
    }
    

    will become

    A   |B   |D   |E   |F
    ----+----+----+----+----
    aaaa|bbbb|d2d2|e2e2|f2f2
    

    which is an obvious flawed, buggy behavior, which will require a more sophisticated approach that I leave for you to build on my scratch.

VBobCat
  • 2,527
  • 4
  • 29
  • 56
  • I get where you're going, and was able to see out on this. Just to be clear though, this doesn't actually use `JsonPath` right? – Atl LED Mar 28 '17 at 23:50
  • It uses JsonPath once at the beginning. Your sample, however, made me think simple iterations could make it from the point we have a collection of record-like objects. Please keep in mind the limitations of my code, though. There's plenty of room for improvement. – VBobCat Mar 29 '17 at 00:01
  • Right, but it got me on the right track. The subsequent deserialize to dictionary was the real trick. Because I hate myself, I'm also playing around with trying to post to these servers, leading to this [question](http://stackoverflow.com/questions/43097829/posting-an-array-parameter-using-restsharp) if you have a moment. – Atl LED Mar 29 '17 at 15:29