56

Trying to convert a CSV file into a JSON

Here is two sample lines :

-21.3214077;55.4851413;Ruizia cordata
-21.3213078;55.4849803;Cossinia pinnata

I would like to get something like :

"occurrences": [
                 {
                "position": [-21.3214077, 55.4851413],
                "taxo": {
                    "espece": "Ruizia cordata"
                 },
                 ...
             }]

Here is my script :

    echo '"occurences": [ '

cat se.csv | while read -r line
  do
      IFS=';' read -r -a array <<< $line;
      echo -n -e '{ "position": [' ${array[0]}
      echo -n -e ',' ${array[1]} ']'
      echo -e ', "taxo": {"espece":"' ${array[2]} '"'
done
echo "]";

I get really strange results :

   "occurences": [ 
 ""position": [ -21.3214077, 55.4851413 ], "taxo": {"espece":" Ruizia cordata
 ""position": [ -21.3213078, 55.4849803 ], "taxo": {"espece":" Cossinia pinnata

What is wrong with my code ?

jq170727
  • 13,159
  • 3
  • 46
  • 56
HydrUra
  • 1,336
  • 2
  • 13
  • 23
  • 2
    Is there a reason you aren't using `jq`, or otherwise JSON-aware tooling? Generating JSON (or other structured-data serializations) with string concatenation is... error-prone at best. – Charles Duffy Jun 27 '17 at 12:52
  • 1
    Other than that, there are a *lot* of bad practices here. You're leaving out a bunch of essential quoting, such that if one of your values contained a whitespace-surrounded `*` it would be replaced with a list of filenames. You're using `-e` unnecessarily, so that if your JSON data contained a `\n` sequence (which, in JSON, should be represented as exactly those characters) it would be replaced with a literal newline; **and** by virtue of `-e` your code won't work correctly on shells with a [POSIX: standard `echo`](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/echo.html). – Charles Duffy Jun 27 '17 at 12:54
  • 1
    re: the missing quoting, by the way -- consider making a habit of running your code through http://shellcheck.net/. – Charles Duffy Jun 27 '17 at 13:14

13 Answers13

77

Here's a python one-liner/script that'll do the trick:

cat my.csv | python -c 'import csv, json, sys; print(json.dumps([dict(r) for r in csv.DictReader(sys.stdin)]))'
dess
  • 223
  • 4
  • 8
jstaab
  • 3,449
  • 1
  • 27
  • 40
  • 9
    This should be the accepted solution -- it does the trick for any and all payloads. The jq version is a one-off and requires painstakingly matching the schema – btk Mar 23 '21 at 20:58
  • 1
    I love this idea. I modified slightly to write to file... `cat in.csv | python -c 'import csv, json, sys; f = open("out.json", "x"); f.write(json.dumps([dict(r) for r in csv.DictReader(sys.stdin)])); f.close()'` – Abraham Labkovsky Jun 10 '21 at 19:16
  • 2
    to write it to a file just add `| > filename.json` at the end. Like this: `cat my.csv | python -c 'import csv, json, sys; print(json.dumps([dict(r) for r in csv.DictReader(sys.stdin)]))' | > my.json` – K14 Nov 24 '21 at 18:52
  • @btk, no, it shouldn't ... it doesn't address the need to named entities in the json output at all. It *may* do the right thing if the CVS has named headers (I don't have the data or time to create it to verify that it would). – tink Nov 30 '21 at 18:42
  • @tink it works when all headers are present **only**. – Fravadona Nov 30 '21 at 23:48
  • thanks for proving my point that this **shouldn't** be the accepted answer then, @Fravadona :) – tink Dec 01 '21 at 00:11
  • 3
    @tink imo it's way easier to add headers to the .csv file than futz around with a complicated jq query – btk Dec 29 '21 at 20:08
  • 1
    I don't think the `dict` is needed and you could just do `list(csv.DictReader(sys.stdin))` instead. – Lauren Yim Mar 04 '22 at 22:55
  • @K14 somehow adding the output part gave me an error " Broken pipe" – Chloe Sun Apr 06 '22 at 19:16
  • Doesn't support csv's with `"` in them. – Mint Nov 24 '22 at 01:52
  • TBH there are times when I got pretty fed up with futzing around with `jq` @btk, not lying – NeilG May 12 '23 at 05:56
  • With bigger files you want a "lazy" version: `python -c 'import csv, json, sys; any(map(print, (json.dumps(dict(r)) for r in csv.DictReader(sys.stdin))))'`. It uses generators so it does not build the whole list, but it prints a JSONL output instead of a single JSON array. – gdforj May 31 '23 at 14:13
38

The right tool for this job is jq.

jq -Rsn '
  {"occurrences":
    [inputs
     | . / "\n"
     | (.[] | select(length > 0) | . / ";") as $input
     | {"position": [$input[0], $input[1]], "taxo": {"espece": $input[2]}}]}
' <se.csv

emits, given your input:

{
  "occurences": [
    {
      "position": [
        "-21.3214077",
        "55.4851413"
      ],
      "taxo": {
        "espece": "Ruizia cordata"
      }
    },
    {
      "position": [
        "-21.3213078",
        "55.4849803"
      ],
      "taxo": {
        "espece": "Cossinia pinnata"
      }
    }
  ]
}

By the way, a less-buggy version of your original script might look like:

#!/usr/bin/env bash

items=( )
while IFS=';' read -r lat long pos _; do
  printf -v item '{ "position": [%s, %s], "taxo": {"espece": "%s"}}' "$lat" "$long" "$pos"
  items+=( "$item" )
done <se.csv

IFS=','
printf '{"occurrences": [%s]}\n' "${items[*]}"

Note:

  • There's absolutely no point using cat to pipe into a loop (and good reasons not to); thus, we're using a redirection (<) to open the file directly as the loop's stdin.
  • read can be passed a list of destination variables; there's thus no need to read into an array (or first to read into a string, and then to generate a heresting and to read from that into an array). The _ at the end ensures that extra columns are discarded (by putting them into the dummy variable named _) rather than appended to pos.
  • "${array[*]}" generates a string by concatenating elements of array with the character in IFS; we can thus use this to ensure that commas are present in the output only when they're needed.
  • printf is used in preference to echo, as advised in the APPLICATION USAGE section of the specification for echo itself.
  • This is still inherently buggy since it's generating JSON via string concatenation. Don't use it.
peak
  • 105,803
  • 17
  • 152
  • 177
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
21

The accepted answer uses jq to parse the input. This works but jq doesn't handle escapes i.e. input from a CSV produced from Excel or similar tools is quoted like this:

foo,"bar,baz",gaz

will result in the incorrect output, as jq will see 4 fields, not 3.

One option is to use tab-separated values instead of comma (as long as your input data doesn't contain tabs!), along with the accepted answer.

Another option is to combine your tools, and use the best tool for each part: a CSV parser for reading the input and turning it into JSON, and jq for transforming the JSON into the target format.

The python-based csvkit will intelligently parse the CSV, and comes with a tool csvjson which will do a much better job of turning the CSV into JSON. This can then be piped through jq to convert the flat JSON output by csvkit into the target form.

With the data provided by the OP, for the desired output, this as as simple as:

csvjson --no-header-row  |
  jq '.[] | {occurrences: [{ position: [.a, .b], taxo: {espece: .c}}]}'

Note that csvjson automatically detects ; as the delimiter, and without a header row in the input, assigns the json keys as a, b, and c.

The same also applies to writing to CSV files -- csvkit can read a JSON array or new-line delimited JSON, and intelligently output a CSV via in2csv.

Raman
  • 17,606
  • 5
  • 95
  • 112
10

John Kerl's Miller tool has this built-in:

mlr --c2j --jlistwrap cat INPUT.csv > OUTPUT.json
richardkmiller
  • 2,902
  • 3
  • 31
  • 29
  • 2
    I love 'jq' but this is really nice, at least for converting a column-header carrying CSV to JSON. @richardkmiller – Nirmalya Sep 02 '21 at 05:41
6

Here is an article on the subject: https://infiniteundo.com/post/99336704013/convert-csv-to-json-with-jq

It also uses JQ, but a bit different approach using split() and map().

jq --slurp --raw-input \
   'split("\n") | .[1:] | map(split(";")) |
      map({
         "position": [.[0], .[1]],
         "taxo": {
             "espece": .[2]
          }
      })' \
  input.csv > output.json

It doesn't handle separator escaping, though.

Ondra Žižka
  • 43,948
  • 41
  • 217
  • 277
  • This is a good general approach! Perhaps you might edit to work with the OP's data structure? (Or would a 3rd-party edit doing so be welcome?) – Charles Duffy Aug 30 '18 at 15:06
  • @CharlesDuffy, I gave it a shot, but not tested - feel free to fix/improve. – Ondra Žižka Aug 30 '18 at 23:12
  • 1
    Needed some minor tweaks -- changing from ',' to ';' as the separator, changing `".[3]"` to `.[2]`; and `--raw-output` wasn't serving any purpose (it's ignored when output isn't a string). – Charles Duffy Aug 30 '18 at 23:31
  • Also, the `.[1:]` (skipping the first line) is only appropriate if input has a header; that was true in the blog post, but I'm not sure it's true here. – Charles Duffy Sep 10 '18 at 12:33
  • @CharlesDuffy, how would you parse the headers and then made the map automatically? Imagine you have different CSV files with different columns and want the JSON object keys derived from the header. Does `jq` have some kind of variables? Or perhaps an extra call to `jq ... .[:1] to fill a Bash array, somehow? – Ondra Žižka Sep 27 '18 at 00:11
  • Yes, jq does have variables. – Charles Duffy Sep 27 '18 at 00:17
4

Here is the Ruby one-liner solution:

ruby -r json -r csv -e 'puts CSV.parse(STDIN, headers:true).map(&:to_h).to_json' < INPUT.csv 
Tsuneo Yoshioka
  • 7,504
  • 4
  • 36
  • 32
2

In general, if your jq has the inputs built-in filter (available since jq 1.5), then it is better to use it rather than the -s command-line option.

Here in any case is a solution using inputs. This solution is also variable-free.

{"occurrences":
  [inputs
   | select(length > 0)
   | . / ";"
   | {"position": [.[0], .[1]], 
      "taxo": {"espece": .[2]}} ]}

SSV, CSV, and all that

The above of course assumes that the file has semicolon-separated fields in each line, and that there are none of the complications associated with CSV files.

If the input has fields that are strictly delimited by a single character, then jq should have no problems handling it. Otherwise, it might be best to use a tool that can reliably convert to the TSV (tab-separated value) format, which jq can handle directly.

peak
  • 105,803
  • 17
  • 152
  • 177
2

Here is another way using Miller (mlr)

mlr --implicit-csv-header --icsv --fs ';' --ojson label position,taxo,espece example.csv

which will produce the following

[
{
  "position": -21.3214077,
  "taxo": 55.4851413,
  "espece": "Ruizia cordata"
},
{
  "position": -21.3213078,
  "taxo": 55.4849803,
  "espece": "Cossinia pinnata"
}
]

or if you use --ojsonl (JSON line) instead of plain JSON will generate the following

{"position": -21.3214077, "taxo": 55.4851413, "espece": "Ruizia cordata"}
{"position": -21.3213078, "taxo": 55.4849803, "espece": "Cossinia pinnata"}
1

Because the jq solution does not handle CSV escaping, column names at the first line, commented-out lines and other common CSV "features", I have extended the CSV Cruncher tool to allow reading CSV and writing it as JSON. It's not exactly "Bash", but neither is jq :)

It's primarily a CSV-as-SQL processing app, so it's not completely trivial, but here is the trick:

./crunch -in myfile.csv -out output.csv --json -sql 'SELECT * FROM myfile'

It also allows output as JSON object per line or proper JSON array. See the documentation.

It's in beta quality, so all feedback or pull requests are welcome.

Ondra Žižka
  • 43,948
  • 41
  • 217
  • 277
1

For completeness sake, Xidel together with some XQuery magic can do this too:

xidel -s input.csv --xquery '
  {
    "occurrences":for $x in tokenize($raw,"\n") let $a:=tokenize($x,";") return {
      "position":[
        $a[1],
        $a[2]
      ],
      "taxo":{
        "espece":$a[3]
      }
    }
  }
'
{
  "occurrences": [
    {
      "position": ["-21.3214077", "55.4851413"],
      "taxo": {
        "espece": "Ruizia cordata"
      }
    },
    {
      "position": ["-21.3213078", "55.4849803"],
      "taxo": {
        "espece": "Cossinia pinnata"
      }
    }
  ]
}
Reino
  • 3,203
  • 1
  • 13
  • 21
1

If you want to go crazy, you can write a parser using jq. Here's my implementation which can be thought of as the inverse of the @csv filter. Throw this into your .jq file.

def do_if(pred; update):
    if pred then update else . end;
def _parse_delimited($_delim; $_quot; $_nl; $_skip):
    [($_delim, $_quot, $_nl, $_skip)|explode[]] as [$delim, $quot, $nl, $skip] |
    [0,1,2,3,4,5] as [$s_start,$s_next_value,$s_read_value,$s_read_quoted,$s_escape,$s_final] |
    def _append($arr; $value):
        $arr + [$value];
    def _do_start($c):
        if $c == $nl then
            [$s_start, null, null, _append(.[3]; [""])]
        elif $c == $delim then
            [$s_next_value, null, [""], .[3]]
        elif $c == $quot then
            [$s_read_quoted, [], [], .[3]]
        else
            [$s_read_value, [$c], [], .[3]]
        end;
    def _do_next_value($c):
        if $c == $nl then
            [$s_start, null, null, _append(.[3]; _append(.[2]; ""))]
        elif $c == $delim then
            [$s_next_value, null, _append(.[2]; ""), .[3]]
        elif $c == $quot then
            [$s_read_quoted, [], .[2], .[3]]
        else
            [$s_read_value, [$c], .[2], .[3]]
        end;
    def _do_read_value($c):
        if $c == $nl then
            [$s_start, null, null, _append(.[3]; _append(.[2]; .[1]|implode))]
        elif $c == $delim then
            [$s_next_value, null, _append(.[2]; .[1]|implode), .[3]]
        else
            [$s_read_value, _append(.[1]; $c), .[2], .[3]]
        end;
    def _do_read_quoted($c):
        if $c == $quot then
            [$s_escape, .[1], .[2], .[3]]
        else
            [$s_read_quoted, _append(.[1]; $c), .[2], .[3]]
        end;
    def _do_escape($c):
        if $c == $nl then
            [$s_start, null, null, _append(.[3]; _append(.[2]; .[1]|implode))]
        elif $c == $delim then
            [$s_next_value, null, _append(.[2]; .[1]|implode), .[3]]
        else
            [$s_read_quoted, _append(.[1]; $c), .[2], .[3]]
        end;
    def _do_final($c):
        .;
    def _do_finalize:
        if .[0] == $s_start then
            [$s_final, null, null, .[3]]
        elif .[0] == $s_next_value then
            [$s_final, null, null, _append(.[3]; [""])]
        elif .[0] == $s_read_value then
            [$s_final, null, null, _append(.[3]; _append(.[2]; .[1]|implode))]
        elif .[0] == $s_read_quoted then
            [$s_final, null, null, _append(.[3]; _append(.[2]; .[1]|implode))]
        elif .[0] == $s_escape then
            [$s_final, null, null, _append(.[3]; _append(.[2]; .[1]|implode))]
        else # .[0] == $s_final
            .
        end;
    reduce explode[] as $c (
        [$s_start,null,null,[]];
        do_if($c != $skip;
            if .[0] == $s_start then
                _do_start($c)
            elif .[0] == $s_next_value then
                _do_next_value($c)
            elif .[0] == $s_read_value then
                _do_read_value($c)
            elif .[0] == $s_read_quoted then
                _do_read_quoted($c)
            elif .[0] == $s_escape then
                _do_escape($c)
            else # .[0] == $s_final
                _do_final($c)
            end
        )
    )
    | _do_finalize[3][];
def parse_delimited($delim; $quot; $nl; $skip):
    _parse_delimited($delim; $quot; $nl; $skip);
def parse_delimited($delim; $quot; $nl):
    parse_delimited($delim; $quot; $nl; "\r");
def parse_delimited($delim; $quot):
    parse_delimited($delim; $quot; "\n");
def parse_delimited($delim):
    parse_delimited($delim; "\"");
def parse_csv:
    parse_delimited(",");

For your data, you would want to change the delimiter to semicolons.

$ cat se.csv
-21.3214077;55.4851413;Ruizia cordata
-21.3213078;55.4849803;Cossinia pinnata
$ jq -R 'parse_delimited(";")' se.csv
[
  "-21.3214077",
  "55.4851413",
  "Ruizia cordata"
]
[
  "-21.3213078",
  "55.4849803",
  "Cossinia pinnata"
]

This will work fine for most inputs to parse a line at a time, but if your data has literal newlines, you will want to read the entire file as a string.

$ cat input.csv
Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
$ jq -Rs 'parse_csv' input.csv
[
  "Year",
  "Make",
  "Model",
  "Description",
  "Price"
]
[
  "1997",
  "Ford",
  "E350",
  "ac, abs, moon",
  "3000.00"
]
[
  "1999",
  "Chevy",
  "Venture \"Extended Edition\"",
  "",
  "4900.00"
]
[
  "1999",
  "Chevy",
  "Venture \"Extended Edition, Very Large\"",
  "",
  "5000.00"
]
[
  "1996",
  "Jeep",
  "Grand Cherokee",
  "MUST SELL!\nair, moon roof, loaded",
  "4799.00"
]
Jeff Mercado
  • 129,526
  • 32
  • 251
  • 272
1

A version of Jstaabs answer that avoids storing the whole dictionary structure in memory before printing, in case the CSV file is large.

import csv, json, sys

for r in csv.DictReader(sys.stdin):
    print(dict(r))

Note this outputs JSON lines format and not exactly JSON.

Here's a version which outputs a proper JSON array, at the expense of being longer.

import csv, json, sys

sys.stdout.write('[')

first = True
for r in csv.DictReader(sys.stdin):
    if not first:
        sys.stdout.write(',')
    first = False
    json.dump(dict(r), sys.stdout)

sys.stdout.write(']')
0

For anyone looking for a php one-liner solution:

php -r 'echo json_encode(array_map(''str_getcsv'', file(''file.csv'')));'

If you need to specify a different delimiter:

php -r 'echo json_encode(array_map(fn($line) => str_getcsv($line, '';''), file(''file.csv'')));'

You might also use some flags when using json_decode:

https://www.php.net/manual/pt_BR/function.json-encode.php

Richard
  • 114
  • 2
  • 7