0

For example: I have a program that generates usage logs like this in a JSON file. The JSON file log contains a lot of the same key called "activity" like the following:

  "probe": "PROCESS_PROBE",
  "status": "ProcessCreated",
  "processName": "backgroundTaskHost.exe",
  "path": "C:\\WINDOWS\\system32\\backgroundTaskHost.exe",
  "creationClassName": "Win32_Process",
  "handle": "21632",
  "priority": "Normal",
  "commandLine": "\"C:\\WINDOWS\\system32\\backgroundTaskHost.exe\" -ServerName:CortanaUI.AppXy7vb4pc2dr3kc93kfc509b1d0arkfb2x.mca",
  "handleCount": 236,
  "processId": 21632,
  "parentProcessId": 112,
  "pageFileUsage": 4244,
  "creationDate": "20200410172922.614702+120",
  "annotations": {
    "userName": "datta",
    "timeSinceStartup": 259878750,
    "ticksOfEvent": 637221365629757593
  }
},
"activity":{
  "probe": "PROCESS_PROBE",
  "status": "ProcessDeleted",
  "processName": "RuntimeBroker.exe",
  "path": "C:\\Windows\\System32\\RuntimeBroker.exe",
  "creationClassName": "Win32_Process",
  "handle": "8504",
  "priority": "Normal",
  "handleCount": 285,
  "processId": 8504,
  "parentProcessId": 112,
  "pageFileUsage": 3180,
  "creationDate": "20200410172757.934567+120",
  "terminationDate": null,
  "annotations": {
    "userName": "datta",
    "timeSinceStartup": 259883953,
    "ticksOfEvent": 637221365681937472
  }
},
"activity":{
  "probe": "FILERESOURCE_PROBE",
  "status": "Changed",
  "path": "C:\\Users\\datta\\eclipse\\jee-2019-12",
  "entityName": "eclipse",
  "extension": "",
  "attributes": "Directory",
  "owner": "null",
  "length": 0,
  "isReadOnly": false,
  "creationTime": "2020-01-17T09:42:08.5092897+01:00",
  "lastWriteTime": "2020-03-25T10:56:10.7382329+01:00",
  "lastAccessTime": "2020-04-10T17:29:29.9811767+02:00",
  "annotations": {
    "userName": "datta",
    "timeSinceStartup": 259885750,
    "ticksOfEvent": 637221365699837331
  }
},
"activity":{
  "probe": "FILERESOURCE_PROBE",
  "status": "Changed",
  "path": "C:\\Users\\datta\\eclipse",
  "entityName": "jee-2019-12",
  "extension": "",
  "attributes": "Directory",
  "owner": "null",
  "length": 0,
  "isReadOnly": false,
  "creationTime": "2020-01-17T09:42:08.5083+01:00",
  "lastWriteTime": "2020-01-17T09:42:08.5092897+01:00",
  "lastAccessTime": "2020-04-10T17:29:29.9801436+02:00",
  "annotations": {
    "userName": "datta",
    "timeSinceStartup": 259885750,
    "ticksOfEvent": 637221365699906960
  }
},
"activity":{
  "probe": "FILERESOURCE_PROBE",
  "status": "Changed",
  "path": "C:\\Users\\datta",
  "entityName": "eclipse",
  "extension": "",
  "attributes": "Directory",
  "owner": "null",
  "length": 0,
  "isReadOnly": false,
  "creationTime": "2020-01-17T09:42:08.5083+01:00",
  "lastWriteTime": "2020-01-17T09:42:08.5083+01:00",
  "lastAccessTime": "2020-04-10T17:29:29.9922013+02:00",
  "annotations": {
    "userName": "datta",
    "timeSinceStartup": 259885765,
    "ticksOfEvent": 637221365699922013
  }
}
}

I would like to load the data inside the activity keys as columns of a dataframe. For example, each activity would be a row in the dataframe and the columns would be "probe", "status", "processName", etc.

The problem is when I load the data using logData = json.load(logfile), it only loads the last activity key since it gets overwritten because of duplication. I tried load the data using logData = json.load(logfile, object_pairs_hook=tuple) It loads the data as a huge tuple. I am not sure how to achieve the dataframe that I trying to get. Thanks in advance.

1 Answers1

0

See Does JSON syntax allow duplicate keys in an object?

The problem here is not in JSON, but in the destination structure you are using. Pythons's json module defines importing of JSON object into dictionaries, thus making the processing duplicate properties (keys) impossible.

The real problem here lies with the producer of this JSON. It would have been extremely easy to make a list of records, or even a list of dictionaries ("activity" being the only key in each). For their own reasons, the producer chose to create this structure, which is (barely) legal, but cannot be processed by most JSON consumers (I know the Python, PHP, and most importantly, JavaScript, would all stumble on this).

It is also quite safe to assume that the program generating the file you are trying to read, did not generate it via a JSON package (at least not the entire file). It probably generates chunks of text and appends them to a stream.

Amitai Irron
  • 1,973
  • 1
  • 12
  • 14
  • Thank you for your feedback. I know what you are saying but I have no control of the program that generates these JSON files I am only working with the output data and trying to process these JSON file. I would highly appreciate though if you could tell me how to solve the situation I have. –  May 05 '20 at 11:37
  • @Indrajeet: A regular JSON package would not solve this problem. You would have to write you own code that reads the separate "actvity" keys and their data, wrap them in curly braces, and then use `json.loads()` to process that into a dictionary. – Amitai Irron May 05 '20 at 11:47