1

So this is a problem that I have no idea where to even start so even just a pointer in the right direction would be great.

So I have data that looks like so:

data = {
   "agg": {
      "agg1": [
         {
            "keyWeWant": "*-20.0",
            "asdf": 0,
            "asdf": 20,
            "asdf": 14,
            "some_nested_agg": [
               {
                  "keyWeWant2": 20,
                  "to": 25,
                  "doc_count": 4,
                  "some_nested_agg2": {
                     "count": 7,
                     "min": 2,
                     "max": 5,
                     "keyWeWant3": 2.857142857142857,
                     "sum": 20
                  }
               },
               {
                  "keyWeWant2": 25,
                  "to": 30,
                  "doc_count": 10,
                  "some_nested_agg2": {
                     "count": 16,
                     "min": 2,
                     "max": 10,
                     "keyWeWant3": 6.375,
                     "sum": 102
                  }
               }
            ]
         },
         {
         ...
         },
         {
         ...
         },
         ...
      ]
   }
}

Now from the example, within 'agg' there are N 'agg1' results, within each 'agg1' result there is a 'keyWeWant'. Each 'agg1' result also has a list of 'some_nested_agg' results which each contain a 'keyWeWant2'. Each 'keyWeWant2' value is associated a single 'keyWeWant' value somewhere up in the hierarchy. Similarly each 'keyWeWant2' also contains a set of results for 'some_nested_agg2' (not a list but rather a map this time). Each of the set of results contains a 'keyWeWant3'.

Now I want to flatten this structure while still preserving the association between 'keyWeWant', 'keyWeWant2', and 'keyWeWant3' (I'm essentially de-normalizing) to get something like so:

What I want the function to look like:

[
   {
      "keyWeWant" : "*-20",
      "keyWeWant2" : 20,
      "keyWeWant3" : 2.857142857142857
   },
   {
      "keyWeWant" : "*-20",
      "keyWeWant2" : 25,
      "keyWeWant3" : 6.375
   },
   {
   ...
   },
   {
   ...
   }
]

This is an example where there is only depth 3 but there could be arbitrary depth with some nested values being lists and some being arrays/list.

What I would like to do is write a function to take in the keys I want and where to find them, and then go get the keys and denormalize.

Something that looks like:

function_name(data_map, {
   "keyWeWant" : ['agg', 'agg1'],
   "keyWeWant2" : ['agg', 'agg1', 'some_nested_agg'],
   "keyWeWant" : ['agg', 'agg1', 'some_nested_agg', 'some_nested_agg2']
})

Any ideas? I'm familiar with Java, Clojure, Java-script, and Python and am just looking for a way to solve this that's relatively simple.

Abhinav Ramakrishnan
  • 1,070
  • 12
  • 22
  • FYI - that's a javascript object, not a `JSON structure` - JSON is a string notation of a javascript object - you're not dealing with that in this code – Jaromanda X Sep 24 '16 at 05:40
  • `What I want the function to look like` - that's not a function, that's another javascript object – Jaromanda X Sep 24 '16 at 05:42
  • I see your point, but this is technically dealing with what's returned from elasticsearch which is a JSON. Now after parsing the datait becomes an Object in JS, or a dict in Python, a Map in Java or a hash-map in clojure. Regarding the 'what I want a function to look like' I was talking about the functions inputs. I want it to take in the data (as what ever representation in what ever language) and a map of the keys I want and where to find them. I actually wrote this up in Python and was just providing an example of how I'd like to call the function. – Abhinav Ramakrishnan Sep 24 '16 at 06:04

2 Answers2

2

Here is a JavaScript (ES6) function you could use:

function flatten(data, keys) {
    var key = keys[0];
    if (key in data)
        keys = keys.slice(1);
    var res = keys.length && Object.keys(data)
        .map( key => data[key] )
        .filter( val => Object(val) === val )
        .reduce( (res, val) => res.concat(flatten(val, keys)), []);
    return !(key in data) ? res
        : (res || [{}]).map ( obj => Object.assign(obj, { [key]: data[key] }) );
}

// Sample data
var data = {
   "agg": {
      "agg1": [
         {
            "keyWeWant": "*-20.0",
            "asdf": 0,
            "asdf": 20,
            "asdf": 14,
            "some_nested_agg": [
               {
                  "keyWeWant2": 20,
                  "to": 25,
                  "doc_count": 4,
                  "some_nested_agg2": {
                     "count": 7,
                     "min": 2,
                     "max": 5,
                     "keyWeWant3": 2.857142857142857,
                     "sum": 20
                  }
               },
               {
                  "keyWeWant2": 25,
                  "to": 30,
                  "doc_count": 10,
                  "some_nested_agg2": {
                     "count": 16,
                     "min": 2,
                     "max": 10,
                     "keyWeWant3": 6.375,
                     "sum": 102
                  }
               }
            ]
         },
      ]
   }
};

// Flatten it by array of keys
var res = flatten(data, ['keyWeWant', 'keyWeWant2', 'keyWeWant3']);

// Output result
console.log(res);

Alternative using paths

As noted in comments, the above code does not use path information; it just looks in all arrays. This could be an issue if the keys being looked for also occur in paths that should be ignored.

The following alternative will use path information, which should be passed as an array of sub-arrays, where each sub-array first lists the path keys, and as last element the value key to be retained:

function flatten(data, [path, ...paths]) {
    return path && (
        Array.isArray(data)
            ? data.reduce( (res, item) => res.concat(flatten(item, arguments[1])), [] )
            : path[0] in data && (
                path.length > 1 
                    ? flatten(data[path[0]], [path.slice(1), ...paths])
                    : (flatten(data, paths) || [{}]).map ( 
                        item => Object.assign(item, { [path[0]]: data[path[0]] }) 
                    )
            )
    );
}

// Sample data
var data = {
   "agg": {
      "agg1": [
         {
            "keyWeWant": "*-20.0",
            "asdf": 0,
            "asdf": 20,
            "asdf": 14,
            "some_nested_agg": [
               {
                  "keyWeWant2": 20,
                  "to": 25,
                  "doc_count": 4,
                  "some_nested_agg2": {
                     "count": 7,
                     "min": 2,
                     "max": 5,
                     "keyWeWant3": 2.857142857142857,
                     "sum": 20
                  }
               },
               {
                  "keyWeWant2": 25,
                  "to": 30,
                  "doc_count": 10,
                  "some_nested_agg2": {
                     "count": 16,
                     "min": 2,
                     "max": 10,
                     "keyWeWant3": 6.375,
                     "sum": 102
                  }
               }
            ]
         },
      ]
   }
};

// Flatten it by array of keys
var res = flatten(data, [
    ['agg', 'agg1', 'keyWeWant'], 
    ['some_nested_agg', 'keyWeWant2'], 
    ['some_nested_agg2', 'keyWeWant3']]);

// Output result
console.log(res);
trincot
  • 317,000
  • 35
  • 244
  • 286
  • While this is a really cool succinct implementation (I love the syntactic sugar that ECMAScript 6 provides) it ignores the path altogether. The problem there is that there could be multiple 'keysWeWant' under different paths. We only want those in one path and not those in others. – Abhinav Ramakrishnan Sep 24 '16 at 23:37
  • Yes, if the keys are also in other parts of the data, but should be ignored, then this solution is too simplistic. I have added an alternative function to my answer which does take as argument the paths, and only gathers the information from there. – trincot Sep 25 '16 at 09:21
  • @trincot, this is a wonderful solution. I have a question, though, about your second approach with paths: In the scenario that I want to name a chosen key *differently* in the **output**, does it make sense that `flatten()` would accept an object instead of array of arrays? This way, such an object passed to `flatten()` 2nd parameter will have the format of `{newNameofMyChoice: ['agg', 'agg1', 'keyWeWant'], someOtherNameIdecided: ['some_nested_agg', 'keyWeWant2']}`. Is that tweak possible? – Emman Apr 28 '22 at 10:42
  • @Emman, sure that is possible. – trincot Apr 28 '22 at 11:28
1

There is probably a better way to solve this particular problem (using some ElasticSearch library or something), but here's a solution in Clojure using your requested input and output data formats.

I placed this test data in a file called data.json:

{
    "agg": {
        "agg1": [
            {
                "keyWeWant": "*-20.0",
                "asdf": 0,
                "asdf": 20,
                "asdf": 14,
                "some_nested_agg": [
                    {
                        "keyWeWant2": 20,
                        "to": 25,
                        "doc_count": 4,
                        "some_nested_agg2": {
                            "count": 7,
                            "min": 2,
                            "max": 5,
                            "keyWeWant3": 2.857142857142857,
                            "sum": 20
                        }
                    },
                    {
                        "keyWeWant2": 25,
                        "to": 30,
                        "doc_count": 10,
                        "some_nested_agg2": {
                            "count": 16,
                            "min": 2,
                            "max": 10,
                            "keyWeWant3": 6.375,
                            "sum": 102
                        }
                    }]
            }]}
}

Then Cheshire JSON library parses the data to a Clojure data structure:

(use '[cheshire.core :as cheshire])

(def my-data (-> "data.json" slurp cheshire/parse-string))

Next the paths to get are defined as follows:

(def my-data-map
  {"keyWeWant"  ["agg", "agg1"],
   "keyWeWant2" ["agg", "agg1", "some_nested_agg"],
   "keyWeWant3" ["agg", "agg1", "some_nested_agg", "some_nested_agg2"]})

It is your data_map above without ":", single quotes changed to double quotes and the last "keyWeWant" changed to "keyWeWant3".

find-nested below has the semantics of Clojure's get-in, only then it works on maps with vectors, and returns all values instead of one. When find-nested is given a search vector it finds all values in a nested map where some values can consist of a vector with a list of maps. Every map in the vector is checked.

(defn find-nested
  "Finds all values in a coll consisting of maps and vectors.

  All values are returned in a tree structure:
  i.e, in your problem it returns (20 25) if you call it with
  (find-nested ['agg', 'agg1', 'some_nested_agg', 'keyWeWant2'] 
  my-data).

  Returns nil if not found."
  [ks c]
  (let [k (first ks)]
    (cond (nil? k)    c
          (map? c)    (find-nested (rest ks) (get c k))
          (vector? c) (if-let [e (-> c first (get k))]
                        (if (string? e) e ; do not map over chars in str
                            (map (partial find-nested (rest ks)) e))
                        (find-nested ks (into [] (rest c)))) ; create vec again
          :else       nil)))

find-nested finds the values for a search path:

(find-nested ["agg", "agg1", "some_nested_agg", "keyWeWant2"] my-data) 
; => (20 25)

If all the paths towards the "keyWeWant's are mapped over my-data these are the slices of a tree:

(*-20.0
(20 25)
(2.857142857142857 6.375))

The structure you ask for (all end results with paths getting there) can be obtained from this tree in function-name like this:

(defn function-name
  "Transforms data d by finding (nested keys) via data-map m in d and 
  flattening the structure."
  [d m]
  (let [tree               (map #(find-nested (conj (second %) (first %)) d) m)
        leaves             (last tree)
        leaf-indices       (range (count leaves))
        results            (for [index leaf-indices]
                             (map (fn [slice]
                                    (if (string? slice)
                                      slice
                                      (loop [node (nth slice index)]
                                        (if node
                                          node
                                          (recur (nth slice (dec index)))))))
                                  tree))
        results-with-paths (mapv #(zipmap (keys m) %) results)
        json               (cheshire/encode results-with-paths)]
    json))

results uses a loop to step back if a leaf-index is larger than that particular slice. I think it will work out for deeper nested structures as well -if a next slice is always double the size of a previous slice or the same size it should work out -, but I have not tested it.

Calling (function-name my-data my-data-map) leads to a JSON string in your requested format:

[{
     "keyWeWant": "-20.0",
     "keyWeWant2": 20,
     "keyWeWant3": 2.857142857142857 }
 {
     "keyWeWant": "
-20.0",
     "keyWeWant2" 25,
     "keyWeWant3" 6.375 }]

/edit I see you were looking for a relatively simple solution, that this is not. :-) maybe there is one without having it available in a library. I would be glad to find out how it can be simplified.

user2609980
  • 10,264
  • 15
  • 74
  • 143
  • Your clojure skills are insane! This is actually pretty cool and I'd definitely have to try it out. Your code actually gave me an idea for an implementation leveraging multimethods and passing context along. Thanks :) – Abhinav Ramakrishnan Sep 24 '16 at 23:43
  • 1
    @AbhinavRamakrishnan another approach for finding multiple nested values in nested structures is to use [specter](https://github.com/nathanmarz/specter). The method method `specter/select` can be used with the directives `specter/ALL` to navigate into a vector, `specter/VAL` to capture the path and `selected?` to find nodes by a predicate. – user2609980 Oct 11 '16 at 16:53