0

I'm working on the Adobe Analytics reporting/datawarehouse API, and how to process this data into rows and columns (I'm using python and pandas to solve this). This is all going fine as long as only one dimension is used, but when multiple are queried the idea of breakdowns comes into play.

for example, when I'm using 2 dimensions, the data (JSON format) looks something like this:

{
    "report":{
        "data":[
            {
                "name":"June 13,2017",
                "breakdown":[
                    {
                        "name":"nl",
                        "breakdown":[
                            {
                                "name":"not logged in",
                                "counts":[
                                    "10",
                                    "12"
                                ]
                            },
                            {
                                "name":"logged in",
                                "counts":[
                                    "30",
                                    "2"
                                ]
                            }
                        ]
                    }, ... etc.

To elaborate, counts represents an array of metrics and every breakdown represents a dimension (the numbers and names are fictional).

What I want to do here is to build up sets of data that I can ultimately link together in rows and columns, so the first row of the example would be ( "June 13,2017","nl","not logged in",10,12).

in the example situation I could already do that with the following code:

for period in dataObj['report']['data']: 
    for firstbreakdown in period['breakdown']:
        for secondbreakdown in firstbreakdown['breakdown']:
            print(period['name'], firstbreakdown['name'], secondbreakdown['name'], secondbreakdown['counts'])

Processing the counts here is not a problem as I've found out I can easily do that with pandas. But here comes the tricky part and my actual question: the number of breakdowns in the JSON example is determined by the number of dimensions that are queried upon.

How can I automatically, based on the number of dimensions queried, make enough loops to keep getting all the dimension values and counts in a row?

I would prefer an elegant & always working solution to do this instead of an if else statement.

dvdvoorn
  • 23
  • 2
  • Is recursion something you want to avoid? I.e. is the number of dimensions ever going to be significant, like in the hundreds and more? – Aurel Bílý Jul 09 '17 at 15:18
  • Yep recursion is the first idea that comes to mind -- this answer could provide inspiration: https://stackoverflow.com/questions/21028979/recursive-iteration-through-nested-json-for-specific-key-in-python – cmaher Jul 09 '17 at 15:22
  • I do not expect that to be a problem, as there are actually not more than 100-150 dimensions in the account and you'll never use all of them in one report – dvdvoorn Jul 09 '17 at 15:23

1 Answers1

1

Recursive method

If you don't mind recursion (see below), a way to solve this would be using a recursive function to iterate deeper and deeper dimensions. For example:

def iterBD(dim, breakdown):
  if (dim > 0):
    return [ [breakdown["name"]] + iterBD(dim - 1, sub) for sub in breakdown["breakdown"] ]
  return [ [breakdown["name"]] + data["counts"] for data in breakdown ]

Then call on your example data with e.g. iterBD(2, report.data). The result would be an list of lists in the format:

["June 13,2017", "nl", "not logged in", ["10", "12"]]
["June 13,2017", "nl", "logged in", ["30", "2"]]
...

You could potentially detect whether there are additional dimensions to iterate if you are dealing with heterogenous data / rows with different number of dimensions. For example:

def iterBD2(breakdown):
  if (hasattr(breakdown, "breakdown")):
    return [ [breakdown["name"]] + iterBD2(sub) for sub in breakdown["breakdown"] ]
  return [ [breakdown["name"]] + data["counts"] for data in breakdown ]

(might want to check this How to know if an object has an attribute in Python for hasattr / using try ... catch)

This method would be called like iterBD2(report.data), without having to specify the dimensions.


When recursion could be bad

The above is a recursive method, meaning it relies on the data having relatively few dimensions. If there is a row with e.g. thousands of nested dimensions, Python might have stack problems. As with any recursive function, the above can be rewritten to a non-recursive function. As you pointed out in your comment, the number of dimensions should never be that high in your specific case, and if you have to deal with such an issue, it might be time to reconsider the queries themselves to obtain better data.

Community
  • 1
  • 1
Aurel Bílý
  • 7,068
  • 1
  • 21
  • 34