5

How do I use jq to convert an arbitrary JSON array of objects to CSV, while objects in this array are nested?

StackOverflow has a sea of questions/answers where specific input or output fields are referenced, but I'd like to have a generic solution that

  1. includes a header row,
  2. works for any JSON input including nested arrays + objects,
  3. allows records that have missing values for keys that are present in other records
  4. does not hard-code any field names,
  5. allows converting the CSV back into the nested JSON structure if needed, and
  6. uses key paths as header names (see the following description).

Dot notation

Many JSON-using products (like CouchDB, MongoDB, …) and libraries (like Lodash, …) use variations of syntax that allows access to nested property values / subfields by joining key fragments with a character, often a dot (‘dot notation’).

An example of a key path like this would be "a.b.0.c" to refer to the deeply nested property in this JSON snippet:

{
  "a": {
    "b": [
      {
        "c": 123,
      }
    ]
  }
}

Caveat: Using this method is a pragmatic solution for most cases, but means that either dot characters have to be banned in property names, or a more complex (and definitely never used property name) has to be invented for escaping dots in property names / accessing nested fields. MongoDB simply banned usage of "." in documents until v5.0, some libraries have workarounds for field access (Lodash example).

Despite this, for simplicity, a solution should use the described dot syntax in the CSV output’s header for nested properties. Bonus if there is a solution variant that solves this problem, e.g. with JSONPath.

Example JSON array as input

[
    {
        "a": {
            "b": [
                {
                    "c": 123
                }
            ]
        }
    },
    {
        "a": {
            "b": [
                {
                    "c": "foo \" bar",
                    "d": "qux"
                }
            ]
        }
    },
    {
        "a": {
            "b": [
                {
                    "d": 456
                }
            ]
        }
    }
]

Example CSV output

The output should have a header that includes all fields (even if the object at the first array does not have defined values for all existing key paths).

To make the output intuitively editable by humans, each row should represent one object in the input array.

The expected output should look like this:

"a.b.0.c","a.b.0.d"
123,
"foo "" bar","qux"
,456

Command line

This is what I need:

cat example.json | jq <MISSING CODE HERE>
opyh
  • 1,494
  • 14
  • 19
  • 1
    What is the point of converting JSON to CSV if you're going to convert it back to JSON? – oguz ismail Sep 18 '21 at 05:04
  • CSV is a horrible format for many reasons, especially its non-standardized use. Still, there is a use for CSVs: People that are less affine with programming (but with Excel or Google Sheets) know how to import/export them. You can, for example, import such a CSV in Google Sheets, use it as a low-tech admin UI, and use its API to sync the data back. While there are better processes, sometimes the process "give somebody a CSV of my data and convert it back when they are done with editing it" is the process with the least friction to get a job done more quickly. – opyh Sep 18 '21 at 07:40
  • 1
    I see. How do you distinguish nulls and booleans from JSON strings when converting back from CSV, though? – oguz ismail Sep 18 '21 at 09:22
  • 1
    For the record, some closely related answers are at https://stackoverflow.com/questions/57242240/jq-object-cannot-be-csv-formatted-only-array – peak Sep 18 '21 at 20:36
  • That's a good objection. In my case, I know the JSON's schema, and have a library that can cast values back correctly. For other use cases, keeping type info in the output might be very important. – opyh Sep 20 '21 at 13:05

2 Answers2

5

Solution 1, using dot notation

Here is the jq call to convert your array of nested JSON objects to CSV:

jq -r '(. | map(leaf_paths) | unique) as $cols | map (. as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows | ([($cols | map(. | map(tostring) | join(".")))] + $rows) | map(@csv) | .[]

The fastest way to try this solution out is to use JQPlay.

The CSV output will have a header row. It will contain all properties that exist anywhere in the input objects, including nested ones, in dot notation. Each input array element will be represented as a single row, properties that are missing will be represented as empty CSV fields.

Using solution 1 in bash or a similar shell

  • Create the JSON input file…

    echo '[{"a": {"b": [{"c": 123}]}},{"a": {"b": [{"c": "foo \" bar","d": "qux"}]}},{"a": {"b": [{"d": 456}]}}]' > example.json
    
  • Then use this jq command to output the CSV on the standard output:

    cat example.json | jq -r '(. | map(leaf_paths) | unique) as $cols | map (. as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows | ([($cols | map(. | map(tostring) | join(".")))] + $rows) | map(@csv) | .[]'
    
  • …or write the output to example.csv:

    cat example.json | jq -r '(. | map(leaf_paths) | unique) as $cols | map (. as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows | ([($cols | map(. | map(tostring) | join(".")))] + $rows) | map(@csv) | .[]' > example.csv
    

Converting the data from solution 1 back to JSON

Here is a Node.js example that you can try on RunKit. It converts a CSV generated with the method in solution 1 back to an array of nested JSON objects.

Explanation for solution 1

Here is a longer, commented version of the jq filter.

# 1) Find all unique leaf property names of all objects in the input array. Each nested property name is an array with the components of its key path, for example ["a", 0, "b"].
(. | map(leaf_paths) | unique) as $cols |

# 2) Use the found key paths to determine all (nested) property values in the given input records.
map (. as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows |

  # 3) Create the raw output array of rows. Each row is represented as an array of values, one element per existing column.
  (

    # 3.1) This represents the header row. Key paths are generated here.
    [($cols | map(. | map(tostring) | join(".")))]

    + # 3.2) concatenate the header row with all other rows
    $rows

  )

  # 4) Convert each row to a escaped CSV string.
  | map(@csv)

  # 5) output each array element directly. Without this, the result would be a JSON array of CSV strings.
  | .[]

Solution 2: for input that does have dots in property names

If you do need to support dot characters in property names, you can either use a different separator string for the key path syntax (replace the dot in "." with something else), or replace the map(tostring) | join(".") part with tostring - this yields a JSON array of strings that you can use as key paths - no dot notation needed. Here is a JQPlay with this solution variant.

Full jq command:

jq -r (. | map(leaf_paths) | unique) as $cols | map (. as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows | ([($cols | map(. | tostring))] + $rows) | map(@csv) | .[]

The output CSV for the variant would look like this then – it’s less readable and not useful for cases where you want humans to intuitively understand the CSV’s header:

"[""a"",""b"",0,""c""]","[""a"",""b"",0,""d""]"
123,
"foo "" bar","qux"
,456

See below for an idea how to convert this format back to a representation in your programming language.

Bonus: Converting the generated CSV back to JSON

If the input's nested properties contain no ".", it’s simple to convert the CSV back to JSON, for example with a library that supports dot notation, or with JSONPath.

  • JavaScript: Use Lodash's _.set()
  • Other languages: Find a package/library that implements JSONPath and use selectors like $.a.b.0.c or $['a']['b'][0]['c'] to set each nested property of each record.

Solution 2 (with JSON arrays as headers) allows you to interpret the headers as JSON array strings. Then you can generate a JSON Path from each header, and re-create all records/objects:

"[""a"",""b"",0,""c""]" (CSV)

["a","b",0,"c"] (array of key-path components after unescaping and parsing as JSON)

$.["a"]["b"][0]["c"] (JSONPath)

{ a: { b: [{c: … }] } } (Nested regenerated object)

I've written an example Node.js script to convert a CSV like this back to JSON. You can try solution 2 in RunKit.

opyh
  • 1,494
  • 14
  • 19
  • Thanks for this idea, you seem to have seen edge cases from hell! :D I thought that escaping with tocsv / unescaping with a CSV parser would handle any character in a JSON property name - Could you elaborate where in the chain nul/newline characters would break the process? – opyh Sep 20 '21 at 12:52
  • The test cases in my post can be used to help ascertain whether any particular approach you have in mind is flawed. – peak Sep 20 '21 at 14:22
  • I've run a test and have been able to successfully convert the CSV from solution 2 back to JSON, including a `\u0000` and a `\n` character in a property key in the input: [jq input](https://jqplay.org/s/IRtS8SG3wC) / [Node.js script that parses the file](https://runkit.com/opyh/614899a33b003f00084c670b) – opyh Sep 20 '21 at 14:29
  • Looks like you've found a good inverse function! – peak Sep 20 '21 at 15:07
  • You should take a look at using the jq streaming functions, should simplify the implementation drastically. – Jeff Mercado Sep 20 '21 at 17:51
  • 1
    Your tocsv function seems to fail when a null occurs, e.g.`[{"a:": {"x": null}, "b": "null"}]` (see https://jqplay.org/s/sspX56-OKz) – peak Sep 20 '21 at 21:05
  • 1
    p.s. Assuming your jq is sufficiently uptodate, you could use this: `def all_leaf_paths: paths as $p | select( getpath($p) | type | IN("array", "object") | not ) | $p; ` – peak Sep 20 '21 at 21:18
1

The following tocsv and fromcsv functions provide a solution to the stated problem except for one complication regarding requirement (6) concerning the headers. Essentially, this requirement can be met using the functions given here by adding a matrix transposition step.

Whether or not a transposition step is added, the advantage of the approach taken here is that there are no restrictions on the JSON keys or values. In particular, they may contain periods (dots), newlines and/or NUL characters.

In the example, an array of objects is given, but in fact any stream of valid JSON documents could be used as input to tocsv; thanks to the magic of jq, the original stream will be recreated by fromcsv (in the sense of entity-by-entity equality).

Of course, since there is no CSV standard, the CSV produced by the tocsv function might not be understood by all CSV processors. In particular, please note that the tocsv function defined here maps embedded newlines in JSON strings or key names to the two-character string "\n" (i.e., a literal backslash followed by the letter "n"); the inverse operation performs the inverse translation to meet the "round-trip" requirement.

(The use of tail is just to simplify the presentation; it would be trivial to modify the solution to make it an only-jq one.)

The CSV is generated on the assumption that any value can be included in a field so long as (a) the field is quoted, and (b) double-quotes within the field are doubled.

Any generic solution that supports "round-trips" is bound to be somewhat complicated. The main reason why the solution presented here is more complex than one might expect is because a third column is added, partly to make it easy to distinguish between integers and integer-valued strings, but mainly because it makes it easy to distinguish between the size-1 and size-2 arrays produced by jq's --stream option. Needless to say, there are other ways these issues could be addressed; the number of calls to jq could also be reduced.

The solution is presented as a test script that checks the round-trip requirement on a telling test case:

#!/bin/bash

function json {
    cat<<EOF
[
  {
    "a": 1,
    "b": [
      1,
      2,
      "1"
    ],
    "c": "d\",ef",
    "embed\"ed": "quote",
    "null": null,
    "string": "null",
    "control characters": "a\u0000c",
    "newline": "a\nb"
  },
  {
    "x": 1
  }
]
EOF
}

function tocsv {
 jq -ncr --stream '
   (["path", "value", "stringp"],
    (inputs | . + [.[1]|type=="string"]))
   | map( tostring|gsub("\"";"\"\"") | gsub("\n"; "\\n"))
   | "\"\(.[0])\",\"\(.[1])\",\(.[2])" 
'
}

function fromcsv { 
    tail -n +2 | # first duplicate backslashes and deduplicate double-quotes
    jq -rR '"[\(gsub("\\\\";"\\\\") | gsub("\"\"";"\\\"") ) ]"' |
    jq -c '.[2] as $s 
           | .[0] |= fromjson 
           | .[1] |= if $s then . else fromjson end 
           | if $s == null then [.[0]] else .[:-1] end
             # handle newlines
           | map(if type == "string" then gsub("\\\\n";"\n") else . end)' |
    jq -n 'fromstream(inputs)'
}    

# Check the roundtrip:
json | tocsv | fromcsv | jq -s '.[0] == .[1]' - <(json)

Here is the CSV that would be produced by json | tocsv, except that SO seems to disallow literal NULs, so I have replaced that by \0:

"path","value",stringp
"[0,""a""]","1",false
"[0,""b"",0]","1",false
"[0,""b"",1]","2",false
"[0,""b"",2]","1",true
"[0,""b"",2]","false",null
"[0,""c""]","d"",ef",true
"[0,""embed\""ed""]","quote",true
"[0,""null""]","null",false
"[0,""string""]","null",true
"[0,""control characters""]","a\0c",true
"[0,""newline""]","a\nb",true
"[0,""newline""]","false",null
"[1,""x""]","1",false
"[1,""x""]","false",null
"[1]","false",null
peak
  • 105,803
  • 17
  • 152
  • 177
  • Nice! I'd accept the answer, but the output CSV represents only one object, not an array of objects, as stated in the question. For my use case, it's about handing database content to somebody else, who can edit it manually. IMO, the most intuitive way to represent a list of database records as a table is row-based. – opyh Sep 20 '21 at 12:57
  • Actually, `fromtsv` can be used with a stream of JSON entities as input, and `tocsv` will reconstruct the stream. Thus if the JSON to be processed is an array, you could operate on `.[]` instead of `.`, and then for the inverse process, reconstitute the array (e.g. using the -s option). Since this really only makes sense if the top-level entities of the array are highly conformal, whereas the emphasis in the question seemed to be on arbitrary JSON, I thought I'd keep things simple. – peak Sep 20 '21 at 14:17
  • p.s. The Q talks about a single array of objects as input, whereas `fromtsv` will work with any (valid) JSON stream, including a single array of objects. – peak Sep 20 '21 at 14:24
  • Thanks for your input - apparently my question was not worded well enough. I've added a sentence to clarify *why* the requested CSV format (see last part of the question) uses one row per input document (it's meant to be consumed/edited by humans, likely in a spreadsheet editor). Really appreciate your thoughtful solution though, might be handy for similar use cases! – opyh Sep 20 '21 at 14:41
  • 1
    I've modified the example so that it now involves an array of objects, and have added a comment that the `tocsv` and `fromcsv` functions can be used on an arbitrary stream of JSON documents, with invertibility built-in. – peak Sep 20 '21 at 14:58