38

If you have a csv dataset like this:

name, age, gender
john, 20, male
jane, 30, female
bob, 25, male

Can you get to this:

[ {"name": "john", "age": 20, "gender": "male"},
  {"name": "jane", "age": 30, "gender": "female"},
  {"name": "bob", "age": 25, "gender": "male"} ]

using only jq?

I found this article which shows what I'm trying to do, but it uses a 'manual' mapping of the header fields to the values. I don't need/want to rename the header fields and have quite a few of them. I would also not want to have to change a script/command every time the layout changes.

Is it possible to dynamically extract the headers and then combine them with the values with a jq one-liner?

dvlsg
  • 5,378
  • 2
  • 29
  • 34
jpl1079
  • 481
  • 1
  • 5
  • 6
  • 2
    What you're asking to do makes no sense. `jq` is a tool that takes json as input and generates an output. csv is not json. You cannot expect this tool to process that, it's not what it was made for. You need to use a tool or other scripting language that can process csv. – Jeff Mercado Apr 18 '15 at 21:52
  • 1
    rather than trying to force `jq` in to this, it could very well be done with an almost trivial shell/sed script, e.g. based on http://stackoverflow.com/questions/4286469/how-to-have-bash-parse-a-csv-file – Hans Z. Apr 19 '15 at 21:22
  • 5
    I am the author of the article referenced by the OP. While the example I provided can be useful for quick-n-dirty jq projects, it isn't very robust. There are some nice tools for working with CSV and I'd recommend one of those instead, for instance: http://johnkerl.org/miller/doc/ (like jq for CSV) or this NPM package https://www.npmjs.com/package/csv2json or this gem https://rubygems.org/gems/csv2json/versions/0.3.0 – Noah Sussman Oct 14 '15 at 15:57

9 Answers9

41

In short - yes, except maybe for the one-liner bit.

jq is often well-suited to text wrangling, and this is especially true of versions with regex support. With regex support, for example, the trimming required by the given problem statement is trivial.

Since jq 1.5rc1 includes regex support and has been available since Jan 1, 2015, the following program assumes a version of jq 1.5; if you wish to make it work with jq 1.4, then see the two "For jq 1.4" comments.

Please also note that this program does not handle CSV in all its generality and complexity. (For a similar approach that does handle CSV more generally, see https://github.com/stedolan/jq/wiki/Cookbook#convert-a-csv-file-with-headers-to-json)

# objectify/1 takes an array of string values as inputs, converts
# numeric values to numbers, and packages the results into an object
# with keys specified by the "headers" array
def objectify(headers):
  # For jq 1.4, replace the following line by: def tonumberq: .;
  def tonumberq: tonumber? // .;
  . as $in
  | reduce range(0; headers|length) as $i ({}; .[headers[$i]] = ($in[$i] | tonumberq) );

def csv2table:
  # For jq 1.4, replace the following line by:  def trim: .;
  def trim: sub("^ +";"") |  sub(" +$";"");
  split("\n") | map( split(",") | map(trim) );

def csv2json:
  csv2table
  | .[0] as $headers
  | reduce (.[1:][] | select(length > 0) ) as $row
      ( []; . + [ $row|objectify($headers) ]);

csv2json

Example (assuming csv.csv is the given CSV text file):

$ jq -R -s -f csv2json.jq csv.csv
[
  {
    "name": "john",
    "age": 20,
    "gender": "male"
  },
  {
    "name": "jane",
    "age": 30,
    "gender": "female"
  },
  {
    "name": "bob",
    "age": 25,
    "gender": "male"
  }
]
peak
  • 105,803
  • 17
  • 152
  • 177
  • I confirm this really works (tested with jq version 1.5). Nice presentation of modularized solution. Anyway, it will take me some time to understand all the constructions. Nice. – Jan Vlcinsky Nov 14 '15 at 19:40
  • 1
    @peak How can that be extended to not wrap "" into `"\"\""` and convert FALSE to `false` and TRUE to `true`? – philk Mar 17 '17 at 12:41
  • @peak I managed to convert the boolean values myself, but the double quotes in strings still bother me. Also where would I have to extend `csv2json` def to only convert rows whos `Brand' value is "MYBRAND"? – philk Mar 17 '17 at 13:14
  • @philk - Perhaps you could tweak trim to trim the outermost double quotes. I don't understand the rest of your question very well. Maybe it's worthy of its own top-level SO question? – peak Mar 17 '17 at 13:32
  • 1
    This solution is pretty good, but I've found that it doesn't do well handling quoted values containing commas in the CSV. That is, if the header row contains `name,age` and a data row contains `"smith, bill",42`, then it will appear in the JSON as `{"name": "\"smith", "age": "bill\""}`. – Mr. Lance E Sloan Jan 28 '20 at 21:14
33

with Miller (http://johnkerl.org/miller/doc/) is very simple. Using this input.csv file

name,age,gender
john,20,male
jane,30,female
bob,25,male

and running

mlr --c2j --jlistwrap cat input.csv

You will have

[
{ "name": "john", "age": 20, "gender": "male" }
,{ "name": "jane", "age": 30, "gender": "female" }
,{ "name": "bob", "age": 25, "gender": "male" }
]

EDIT

This is an old question: the new doc page is https://miller.readthedocs.io/en/latest/

aborruso
  • 4,938
  • 3
  • 23
  • 40
  • @user3041539 I do not completely agree. It's true, this is a jq question, but jq is not the tool to read a csv, then it's necessary to give to an "x" question, an "y" reply. Then I agree with you, but it was in some mandatory – aborruso Nov 20 '19 at 09:01
  • 4
    maybe if it didn't wasn't clear - I support this answer, and think it should be ranked higher. – user3041539 Nov 20 '19 at 09:09
  • https://miller.readthedocs.io/en/latest/ – viktorkho Nov 23 '21 at 17:31
  • 2
    I am so pleased to have stumbled across _miller_ (via this answer). This answer is certainly deserving of its place. These nuggets really enhance Stack. – bobbogo Nov 15 '22 at 10:31
17

As of 2018, a modern no code solution would be to use Python tool csvkit has csvjson data.csv > data.json.

See their documentation https://csvkit.readthedocs.io/en/1.0.2/

The toolkit is also very handy and complementary to jq if your script has to debug both csv and json formats.

You might also want to check a powerful tool called visidata. Here is a screencast case study that is similar to the original poster's. You can also generate script from visidata

Michel Hua
  • 1,614
  • 2
  • 23
  • 44
  • 1
    the question is a classic X/Y problem - where the solution to X is much easier than what is needed to get Y done - xyproblem.info . Thanks for answering X ! – user3041539 Nov 20 '19 at 08:58
  • 2
    Thanks for the `csvkit` tip. Great tool! – while Jun 03 '20 at 15:13
10

I had a little play and came up with this. But it may not be the best way, and I'd be interested to see what your attempts were like, because after all if we both came at a solution I'm sure it'd be twice as good!

But I would start from something like:

true as $doHeaders
| . / "\n"
| map(. / ", ")
| (if $doHeaders then .[0] else [range(0; (.[0] | length)) | tostring] end) as $headers
| .[if $doHeaders then 1 else 0 end:][]
| . as $values
| keys
| map({($headers[.]): $values[.]})

Working Example

The variable $doHeaders controls whether to read the top line as a header line. In your case you want it as true, but I added it for future SO users and because, well, I had an excellent breakfast today and the weather is lovely, so why not?

Little explanation:

1) . / "\n" Split by line...

2) map(. / ", ") ... and comma (Big gotcha: In your version, you'll want to use a regex based split because like this you'll split on commas inside quotation marks too. I just used this because it's terse, and that makes my solution look cool right?)

3) if $doHeaders then... Here we create an array of strings keys or numbers depending on the number of elements in the first row and whether the first row is a header row

4) .[if $doHeaders then 1 else 0 end:] Ok, so trim off the top line if it's a header

5) map({($headers[.]): $values[.]}) Above we go over each row in the former csv, and put the $values into a variable and the keys into a pipe. Then we construct your desired object.

Of course you'll want to use a few regexes to fill in the gotchas, but I hope that starts you on the way.

Mmmh mmh
  • 5,334
  • 3
  • 21
  • 29
Tom
  • 1,773
  • 15
  • 23
  • 1
    I would simply wink back, but one is required to write more in a comment these days. – Tom Apr 20 '15 at 08:10
  • Thanks for the effort on this! Admittedly, this is more of a theory question than practical. I ended up doing this in bash, but kept thinking about whether it could be done in only jq, so I asked. The above is close. It outputs `[ {"name": "john"}, {"age": 20}, {"gender": "male"}...` – jpl1079 Apr 21 '15 at 20:53
  • and I'd upvote you if I had the reputation needed :) Thanks again. – jpl1079 Apr 21 '15 at 20:56
  • Pleasure. Oh- and whoops as it happens I missed out the final `| add` at the end of the query, but there we go. Best. – Tom Apr 22 '15 at 10:17
10

yq (disclaimer I wrote it) supports this out of the box:

yq file.csv -p=csv -o=json

yields:

[
  {
    "name": "john",
    " age": 20,
    " gender": "male"
  },
  {
    "name": "jane",
    " age": 30,
    " gender": "female"
  },
  {
    "name": "bob",
    " age": 25,
    " gender": "male"
  }
]

The original CSV has leading spaces on columns 2 and 3 - not sure if that's a mistake or not. You can trim them by add an expression:

yq '(... | select(tag == "!!str")) |= trim'  file.csv -p=csv -o=json

This will match all strings and trim leading spaces, yielding:

[
  {
    "name": "john",
    "age": 20,
    "gender": "male"
  },
  {
    "name": "jane",
    "age": 30,
    "gender": "female"
  },
  {
    "name": "bob",
    "age": 25,
    "gender": "male"
  }
]
mike.f
  • 1,586
  • 13
  • 14
4

Here is a solution that assumes you run jq with -s and -R options.

[
  [                                               
    split("\n")[]                  # transform csv input into array
  | split(", ")                    # where first element has key names
  | select(length==3)              # and other elements have values
  ]                                
  | {h:.[0], v:.[1:][]}            # {h:[keys], v:[values]}
  | [.h, (.v|map(tonumber?//.))]   # [ [keys], [values] ]
  | [ transpose[]                  # [ [key,value], [key,value], ... ]
      | {key:.[0], value:.[1]}     # [ {"key":key, "value":value}, ... ]
    ]
  | from_entries                   # { key:value, key:value, ... }
]

Sample run:

jq -s -R -f filter.jq data.csv

Sample output

[
  {
    "name": "john",
    "age": 20,
    "gender": "male"
  },
  {
    "name": "jane",
    "age": 30,
    "gender": "female"
  },
  {
    "name": "bob",
    "age": 25,
    "gender": "male"
  }
]
jq170727
  • 13,159
  • 3
  • 46
  • 56
  • 2
    jq is getting error when parsing filter file `jq: error: syntax error, unexpected ?//, expecting ';' or ')' (Unix shell quoting issues?) at , line 8: | [.h, (.v|map(tonumber?//.))] # [ [keys], [values] ] jq: 1 compile error` – QkiZ Aug 25 '20 at 12:56
2

Here is a fairly simple "one-liner" version for jq that will work for "reasonably" sized files, for very large files you would need a version that doesn't use slurp. I'm fairly new to jq and I'm sure there are even better ways to do this (maybe just incr an index value instead of storing in the data). You can replace "split" with ./"\n" and ./"," if you want to make it even shorter and harder to read. NOTE: if you really need the space after the comma can split on ", " or add |map(gsub("^\s+|\s+$";"")) after the split on comma to trim leading & trailing white space.

jq -Rs 'split("\n")|map(split(",")|to_entries)|.[0] as $header|.[1:]|map(reduce .[] as $item ({};.[$header[$item.key].value]=$item.value))'

Here is a commented version:

# jq -Rs
split("\n") | map( split(",") | to_entries ) # split lines, split comma & number
  | .[0] as $header # save [0]
  | .[1:] # and then drop it
  | map( reduce .[] as $item ( {}; .[$header[$item.key].value] = $item.value ) )

The top portion is pretty straight-forward: split the data on newline, then for each of those elements split on comma and then to_entries will turn each of those into key/value entries with a numbering of the keys (0..N): {key:#, value:string}

Then it uses map/reduce to take each element and replace it with an object of key/value pairs using the numbered key to index back into the headers to get the label. For those new to reduce (like me) the first element up to the semi-colon is to initialize the 'accumulator' (the thing you modify each pass over the elements) so .[...] is modifying the accumulator and $item is the object we're operating on.

Update: I got a better version working now that doesn't use slurp, and we do NOT use -n option because it will treat first line specially:

jq -R 'split(",") as $h|reduce inputs as $in ([]; . += [$in|split(",")|. as $a|reduce range(0,length) as $i ({};.[$h[$i]]=$a[$i])])'
DarkStar
  • 555
  • 4
  • 5
  • Your answer is close, but keys and values have extra spaces, and the numeric type is a string. See this [jqplay.org](https://jqplay.org/s/D6iyIsblwYd) of your answer. – rickhg12hs Jan 03 '23 at 07:29
2

Having done something similar recently, here's another jq one-liner to transform the CSV into a JSON array.

jq --null-input --raw-input '[input|scan("\\w+")] as $header |[inputs as $data |[$header,[$data|scan("\\w+")|tonumber? // .]] |transpose |map({(.[0]):.[1]}) |add]' input.csv

Output, given the example input:

[
  {
    "name": "john",
    "age": 20,
    "gender": "male"
  },
  {
    "name": "jane",
    "age": 30,
    "gender": "female"
  },
  {
    "name": "bob",
    "age": 25,
    "gender": "male"
  }
]

Try it on jqplay.org.

rickhg12hs
  • 10,638
  • 6
  • 24
  • 42
1

It can also be done without reduce syntax:

#! /bin/jq -fRs

split("\n")|map(select(.!="")|split(","))
|.[0] as $headers
|.[1:][]
|with_entries(.key=$headers[.key])
JanB
  • 29
  • 4
  • 2
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Sep 23 '21 at 04:52
  • 1
    This answer was reviewed in the [Low Quality Queue](https://stackoverflow.com/help/review-low-quality). Here are some guidelines for [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer). Code only answers are **not considered good answers**, and are likely to be downvoted and/or deleted because they are **less useful** to a community of learners. It's only obvious to you. Explain what it does, and how it's different / **better** than existing answers. – Trenton McKinney Jul 25 '22 at 21:47
  • Your answer is close, but keys and values have extra spaces, and the numeric type is a string. See this [jqplay.org](https://jqplay.org/s/o29YlD6852b) of your answer. – rickhg12hs Jan 03 '23 at 07:27