1

I have some JSON files, all with identical structure (same keys everywhere, corresponding values might differ for some keys). I would like to collect the values associated with certain keys into lists and store those lists as the values associated with those keys in a new JSON file.

As example, consider these three files, where I'm interested in the key number_items and the corresponding values. First file —

[
  {
    "box_id": 1,
    "number_items": 4
  },
  {
    "box_id": 3,
    "number_items": 15
  },
  {
    "box_id": 6,
    "number_items": 2
  }
]

Second file —

[
  {
    "box_id": 1,
    "number_items": 7
  },
  {
    "box_id": 3,
    "number_items": 15
  },
  {
    "box_id": 6,
    "number_items": 4
  }
]

Third file —

[
  {
    "box_id": 1,
    "number_items": 5
  },
  {
    "box_id": 3,
    "number_items": 9
  },
  {
    "box_id": 6,
    "number_items": 0
  }
]

These should be merged into something that looks like this —

[
  {
    "box_id": 1,
    "number_items": [
      4,
      7,
      5
    ]
  },
  {
    "box_id": 3,
    "number_items": [
      15,
      15,
      9
    ]
  },
  {
    "box_id": 6,
    "number_items": [
      2,
      4,
      0
    ]
  }
]

Can this be done using jq? If not, what would be a good way to do this? Note that the actual scenario consists of 150+ files with 3 keys whose values I would like to merge into lists.

Ailurus
  • 731
  • 1
  • 10
  • 23

3 Answers3

2

You can merge files with similar structures by simply passing them all in as input. Their contents will be streamed in in the order they are in.

Then you could just read them in to a single array, group the objects by the box_id then map out the results.

$ jq -n '
    [inputs[]] | group_by(.box_id)
        | map({box_id:.[0].box_id, number_items:map(.number_items)})
' input{1,2,3}.json

produces:

[
  {
    "box_id": 1,
    "number_items": [
      4,
      7,
      5
    ]
  },
  {
    "box_id": 3,
    "number_items": [
      15,
      15,
      9
    ]
  },
  {
    "box_id": 6,
    "number_items": [
      4,
      2,
      0
    ]
  }
]

It seems the order isn't preserved when items are grouped on some platforms. In my case, running on the Windows 64-bit version produces this. So be aware of that if you want to use group_by. There are of course other approaches you could take if you want to avoid using this filter, but this is much more convenient to use.

Jeff Mercado
  • 129,526
  • 32
  • 251
  • 272
  • 1
    Jeff - When I run your program, I get [2,4,0] for the last number_items array, as expected. Did you really get [4,2,0]? – peak Jun 25 '17 at 03:25
  • Hmm, I copied the data as-is and the output I received. I would have expected `[2,4,0]` as well. It looks like it got reordered from the grouping, not what I would have expected. – Jeff Mercado Jun 25 '17 at 03:48
  • @JeffMercado Great. Quick follow-up question, how can I show the sum of number_items per file? Using `jq -s 'map(.[].number_items) | add' input{1,2,3}.json` returns `61`, the sum of all items in in all files. Of course I could run the command manually for each file individually and collect everything afterwards, but I wondered whether `jq` could do this for me? – Ailurus Jun 26 '17 at 10:08
  • (On Linux, I would use `for k in *.json; do jq -s 'map(.[].number_items) | add' $k; done` to do this) – Ailurus Jun 26 '17 at 11:16
  • 1
    @Ailurus: For the sums of each individual file, just don't combine them in the beginning. You can even get the name of the file currently being processed. I'd do: `jq '{file:input_filename, sum:map(.number_items)|add}' *.json`. – Jeff Mercado Jun 26 '17 at 14:36
1

I would like to collect the values associated with certain keys

Here is a solution which treats all keys, except for the grouping key, in the same way. It also handles missing keys gracefully and does NOT depend on the stability of jq's sort. The solution is based on a generic filter, merge/0, defined as follows:

# Combine an array of objects into a single object, ans, with array-valued keys,
# such that for every key, k, in the i-th object of the input array, a,
# ans[k][i] = a[i][k]
# null is used as padding if a value is missing.
# Example:
# [{a:1, b:2}, {b:3, c:4}] | merge
# produces:
# {"a":[1,null],"b":[2,3],"c":[null,4]}
def merge:
  def allkeys: map(keys) | add | unique;
  allkeys as $allkeys
  | reduce .[] as $in ({};
     reduce $allkeys[] as $k (.;
      . + {($k): (.[$k] + [$in[$k]]) } ));

The solution to the given problem can then be formulated as:

transpose | map(merge) | map( .box_id |= .[0] )

Invocation:

  jq -s -f merge.jq input{1,2,3}.json

Output: as shown in the question.

More robust solution

The above solution assumes uniformity of the ordering by box_id within each file. This assumption seems warranted by the OP requirements, but for safety and robustness, the objects would first be sorted:

map(sort_by(.box_id)) | transpose | map( merge | (.box_id |= .[0]) )

Note that this still assumes that there are no missing values of box_id in any of the input files.

Still more robust solution

If there is a possibility that some of the box_id values might be missing in any of the input files, then it would be appropriate to add the missing values. This can be done with the help of the following filter:

# Input: a matrix of objects (that is, an array of rows of objects),
#   each of which is assumed to have a distinguished field, f,
#   with distinct values on each row;
# Output: a rectangular matrix such that every row, r, of the output
#   matrix includes the elements of the corresponding row of the input
#   matrix, with additional elements as necessary so that (r |
#   map(.id) | sort) is the same for all rows r.
#
def rectanglize(f):
  def ids: [.[][] | f] | unique;
  def it: . as $in | {} | (f = $in);
  ids as $ids
  | map( . + ( $ids - [.[]|f] | map(it) ) )
;  

Putting everything together, the main pipeline becomes:

rectanglize(.id)
| map(sort_by(.box_id))
| transpose 
| map( merge | .box_id |= .[0] )
peak
  • 105,803
  • 17
  • 152
  • 177
  • Nice approach which is generally applicable and stable on top of that. Thanks! Nevertheless, Jeff's answer is somewhat easier to understand, straightforward to run and suffices for my simple data files. Therefore I'll mark his answer as the accepted one. – Ailurus Jun 26 '17 at 09:24
0

Depending on where you are trying to save this new file (local vs server), there are several different approaches. As far as I know, there is no possible way to save a file locally without using one of the available plugins (How to write data to a JSON file using Javascript). If you want to save it to a server, this is impossible with JavaScript, and best be done with a background language.

Here is a way to combine the content of several JSON files into your desired format.

// send json files you want combined, and a new file path and name (path/to/filename.json)
  function combineJsonFiles(files, newFileName) {
    var combinedJson = [];
    // iterate through each file 
    $.each(files, function(key, fileName) {
      // load json file
      // wait to combine until loaded. without this 'when().done()', boxes would return 'undefined'
      $.when(loadJsonFile(fileName)).done(function(boxes) {
        // combine json from file with combinedJson array
        combinedJson = combineJson(boxes, combinedJson);
        // check if this is the last file
        if (key == files.length-1) {
          // puts into json format
          combinedJson = JSON.stringify(combinedJson);
          // your json is now ready to be saved to a file
        }
      });
    });
  }

  function loadJsonFile(fileName) {
    return $.getJSON(fileName);
  }



function combineJson(boxes, combinedJson) {
  // iterate through each box 
  $.each(boxes, function(key, box) {
    // use grep to search if this box's id is already included
    var matches = $.grep(combinedJson, function(e) { return e.box_id == box.box_id; });

    // if there are no matches, add box to the combined file
    if (matches.length == 0) {

      var newBox = { box_id: box.box_id };

      // iterate through properties of box
      for (var property in box) {
        // check to ensure that properties are not inherited from base class
        if (box.hasOwnProperty(property)) {
          // will ignore if property is box_id
          if (property !== 'box_id') {
            // box is reformatted to make the property type into array
            newBox[property] = [box[property]];
          }
        }
      }
      combinedJson.push(newBox);
    } else {
      // select first match (there should never be more than one)
      var match = matches[0];

      // iterate through properties of box
      for (var property in box) {
        // check to ensure that properties are not inherited from base class
        if (box.hasOwnProperty(property)) {
          // will ignore if property is box_id
          if (property !== 'box_id') {
            // add property to the already existing box in the combined file
            match[property].push(box[property]);
          }
        }
      }
    }
  });
  return combinedJson;
}

  var jsonFiles = ['path/to/data.json', 'path/to/data2.json', 'path/to/data3.json'];

  combineJsonFiles(jsonFiles, 'combined_json.json');

The JSON output of this will look like:

[{"box_id":1,"number_items":[4,7,5]},{"box_id":3,"number_items":[15,15,9]},{"box_id":6,"number_items":[2,4,0]}]

Hope this helps!