1

Background

I'm importing a JSON file with a schema like the following:

{
    "000001": {
        "USA": {
            "requests": 1234,
            "RTT": 40
        },
        "Japan": {
            "requests": 10,
            "RTT": 200
        }
    },
    "000002": {
        "USA": {
            "requests": 4321,
            "RTT": 50
        },
        "Japan": {
            "requests": 10,
            "RTT": 150
        }
    }
}

After importing, I have two things I wish to do with this data:

  1. Perform a lot of analysis and math, which NumPy would be great at
  2. Dump the data to a database

Upon dumping it to the database, I expect a schema like the following:

+--------+---------+----------+-------+
|  time* | country*| requests |  RTT  |
+--------+---------+----------+-------+
| 000001 | USA     |   1234   | 40ms  |
| 000001 | Japan   |    10    | 200ms |
| 000002 | USA     |   4321   | 50ms  |
| 000002 | Japan   |    20    | 150ms |
|  ...   |   ...   |    ...   |  ...  |

Problem

Currently to import the data, I'm doing the following:

>>>import numpy as np
>>>import flatdict as fd
>>>np.array(fd.FlatDict(json_string))
array([[000001, u'USA', u'requests', 1234],
       [000001, u'USA', u'RTT', 40],
       [000001, u'Japan', u'requests', 10],
       ...,
       [000002, u'USA', u'RTT', 50],
       [000002, u'Japan', u'requests', 20],
       [000002, u'Japan', u'RTT', 150]],
      dtype='<U31')

The problem here is that this creates a unique row for each value, rather than for each key. I'm willing to write my own function to replace flatdict if necessary, but the question is: can NumPy have tuples / ararys / objects for values? Ideally I would want an output like the following:

>>>np.array(data)
array([[000001, u'USA', (1234, 40)],
       [000001, u'Japan', (10, 200)],
       [000002, u'USA', (4321, 50)],
       [000002, u'Japan', (20, 150)]],
      dtype='...')

However I'd be willing to accept:

>>>np.array(data)
array([[000001, u'USA', 1234, 40],
       [000001, u'Japan', 10, 200],
       [000002, u'USA', 4321, 50],
       [000002, u'Japan', 20, 150]],
      dtype='...')

Alternate Acceptable Solution

For step one (performing all of the analysis) I can use the NumPy array I have now. Something like array[:,"USA","requests"] returns the number of requests in USA for all time.

It's for step two that I need to get it into a different format. If there's a way to take:

array([[000001, u'USA', u'requests', 1234],
       [000001, u'USA', u'RTT', 40],
       [000001, u'Japan', u'requests', 10],
       ...,
       [000002, u'USA', u'RTT', 50],
       [000002, u'Japan', u'requests', 20],
       [000002, u'Japan', u'RTT', 150]],
      dtype='<U31')

and get it into the form:

[[000001, u'USA', 1234, 40],
 [000001, u'Japan', 10, 200],
 [000002, u'USA', 4321, 50],
 [000002, u'Japan', 20, 150]]

for the sake of dumping it into a database, that will be fine, too

stevendesu
  • 15,753
  • 22
  • 105
  • 182
  • Technically, they can, but something like Pandas dataframes are likely to be a better option. – user2357112 Aug 29 '17 at 17:23
  • @user2357112 could you post an answer explaining how to use Pandas to solve this issue? I'm not familiar with Pandas. Up until recently I have had minimal experience with Python. – stevendesu Aug 29 '17 at 17:26
  • You can have Python objects as values, but that forces you to use `object` `dtype` arrays, which are pretty much worthless when it comes to efficiency. Rather, if you want to maintain efficiency, you *could* look into using [structured arrays](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.rec.html). However, this requires a fixed-size for your unicode types. Care would have to be taken to choose a size that allows you to accept any possible string, without making it too large that your struct becomes memory inefficient. – juanpa.arrivillaga Aug 29 '17 at 17:41
  • in the JSON string at the top, there should not be a comma here: `"RTT": 200,` – tony_tiger Aug 29 '17 at 17:47
  • @tony_tiger Good catch. Comma removed. – stevendesu Aug 29 '17 at 17:59

1 Answers1

2

You have a problem flattening the dictionary; Didn't know much about flatdict but if you have a dictionary of three levels, you can just use a list comprehension to flatten it:

[[k, kk, d[k][kk].get('requests'), d[k][kk].get('RTT')] for k in d for kk in d[k]]

#[['000002', 'USA', 4321, 50],
# ['000002', 'Japan', 10, 150],
# ['000001', 'USA', 1234, 40],
# ['000001', 'Japan', 10, 200]]

And for non homogeneous data, if you need to do analysis in memory pandas is usually easier to use then numpy, you can convert this to a data frame like this:

lst = [[k, kk, d[k][kk].get('requests'), d[k][kk].get('RTT')] for k in d for kk in d[k]]

import pandas as pd
pd.DataFrame(lst, columns=['time', 'country', 'requests', 'RTT'])

enter image description here

Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Thank you. This perfectly solves my original problem, and so I'm marking it as correct. Although someone mentioned Pandas and I'm now running off to look into that, because it might be a more appropriate solution to my original business need. – stevendesu Aug 29 '17 at 17:33
  • Thank you for updating with the pandas solution, as well. Even with a DataFrame I still have no idea how to use it for aggregation / analysis, so I've got some reading up to do :D – stevendesu Aug 29 '17 at 17:35
  • 3
    `pandas` is good if you want to do analysis after imported; If you just need to load it into a data base, you wouldn't have to use it. – Psidom Aug 29 '17 at 17:35
  • 1
    Does it matter that `time` entries for "000001" appear after entries for "000002"? If so, then load the data as an OrderedDict, then use @Psidom flattening code. https://stackoverflow.com/questions/6921699/can-i-get-json-to-load-into-an-ordereddict-in-python – tony_tiger Aug 29 '17 at 17:54