0

I have a json file 'OpenEnded_mscoco_val2014.json'.The json file contains 121,512 questions.
Here is some sample :

"questions": [
{
  "question": "What is the table made of?",
  "image_id": 350623,
  "question_id": 3506232
},
{
  "question": "Is the food napping on the table?",
  "image_id": 350623,
  "question_id": 3506230
},
{
  "question": "What has been upcycled to make lights?",
  "image_id": 350623,
  "question_id": 3506231
},
{
  "question": "Is this an Spanish town?",
  "image_id": 8647,
  "question_id": 86472
}

]

I used jq -r '.questions | [map(.question), map(.image_id), map(.question_id)] | @csv' OpenEnded_mscoco_val2014_questions.json >> temp.csv to convert json into csv.
But here output in csv is question followed by image_id which is what above code does.
The expected output is :

"What is table made of",350623,3506232
"Is the food napping on the table?",350623,3506230

Also is it possible to filter only results havingimage_id <= 10000 and to group questions having same image_id? e.g. 1,2,3 result of json can be combined to have 3 questions, 1 image_id, 3 question_id.

EDIT : The first problem is solved by possible duplicate question.I would like to know if is it possible to invoke comparison operator on command line in jq for converting json file. In this case get all fields from json if image_id <= 10000 only.

peak
  • 105,803
  • 17
  • 152
  • 177
SupposeXYZ
  • 374
  • 5
  • 15
  • Not quite sure what your first question is here? – JosephGarrone Sep 15 '16 at 05:33
  • Possible duplicate of [How to convert arbirtrary simple JSON to CSV using jq?](http://stackoverflow.com/questions/32960857/how-to-convert-arbirtrary-simple-json-to-csv-using-jq) – Nehal J Wani Sep 15 '16 at 05:44
  • I would like to filter output having image_id having value <= 10000 using jq as file is too large in size so using json_load() and comparing will take a lot of memory. – SupposeXYZ Sep 15 '16 at 05:53
  • 1. Please fix the question so that the example input is valid JSON. 2. If the point of the question is that (.questions | length) is so big that you don't want to have to read the whole file into memory, then please say so. (In that case, jq has a streaming parser which might be able to help.) – peak Sep 15 '16 at 06:18

2 Answers2

1

1) Given your input (suitably elaborated to make it valid JSON), the following query generates the CSV output as shown:

$ jq -r '.questions[] | [.question, .image_id, .question_id] | @csv'

"What is the table made of?",350623,3506232
"Is the food napping on the table?",350623,3506230
"What has been upcycled to make lights?",350623,3506231
"Is this an Spanish town?",8647,86472

The key thing to remember here is that @csv requires a flat array, but as with all jq filters, you can feed it a stream.

2) To filter using the criterion .image_id <= 10000, just interpose the appropriate select/1 filter:

.questions[]
| select(.image_id <= 10000)
| [.question, .image_id, .question_id]
| @csv

3) To sort by image_id, use sort_by(.image_id)

.questions
| sort_by(.image_id)
|.[]
| [.question, .image_id, .question_id]
| @csv

4) To group by .image_id you would pipe the output of the following pipeline into your own pipeline:

.questions | group_by(.image_id)

You will, however, have to decide exactly how you want to combine the grouped objects.

peak
  • 105,803
  • 17
  • 152
  • 177
  • For second answer is it possible to write .question |select(.image_id<=10000)|[.question, .image_id, .question_id]|@csv so that it will return the constrained output? – SupposeXYZ Sep 15 '16 at 06:10
  • In (2), the given filter DOES emit the constrained output! Have you tried it? – peak Sep 15 '16 at 06:13
  • Hey @peak, thanks it worked all !! Is it possible to extract specific question types from JSON data.Like I want only question starting with"How","What is", etc..using json.load(). – SupposeXYZ Sep 17 '16 at 12:07
  • 5. Consider select(startswith(_)). If you have access to a version of jq with regex support, consider also test/1. – peak Sep 17 '16 at 15:58
0

With the -r option, the following filter

  .questions[] | [ .[] ] | @csv

produces

"What is the table made of?",350623,3506232
"Is the food napping on the table?",350623,3506230
"What has been upcycled to make lights?",350623,3506231
"Is this an Spanish town?",8647,86472

To filter the data, use select. E.g. with the -r option the following filter

  .questions[] | select(.image_id <= 10000) | [ .[] ] | @csv

produces the subset

"Is this an Spanish town?",8647,86472

To group the data use group_by. The following filter

    .questions
  | group_by(.image_id)[]
  | [ .[] | [ .[] ] | @csv ]

produces grouped data

[
  "\"Is this an Spanish town?\",8647,86472"
]
[
  "\"What is the table made of?\",350623,3506232",
  "\"Is the food napping on the table?\",350623,3506230",
  "\"What has been upcycled to make lights?\",350623,3506231"
]

This isn't very useful in this form and is probably not exactly what you want but it demonstrates the basic approach.

jq170727
  • 13,159
  • 3
  • 46
  • 56