2

My script processes ~30 lines per second and uses just one CPU core.

while read -r line; do echo "$line" | jq -c '{some-tansfomration-logic}'; done < input.json >> output.json

The input.json is ~6GB 17M lines file. It's a new-line delimited json, not an array.

I have 16 (or more, if makes sense) cores (vCPUs on GCP) and want to run this process in parallel. I know, hadoop is the way to go. But it's a one-time thing, how do I speed up the process to ~600 lines per second simply?

Lines ordering is not important.

peak
  • 105,803
  • 17
  • 152
  • 177
stkvtflw
  • 12,092
  • 26
  • 78
  • 155
  • Does the order of the lines in the output must match the order of the lines in the input ? – dash-o Jul 15 '22 at 07:35
  • no, the order is not important – stkvtflw Jul 15 '22 at 07:38
  • 3
    Won't getting rid of the loop with a single `jq` call be already monstrously faster? – Fravadona Jul 15 '22 at 07:43
  • 1
    you can do this??? Why on earth was I absolutely sure that jq can't handle new-line delimited JSON.... thanks! – stkvtflw Jul 15 '22 at 07:55
  • should I delete this question? Since the premise is wrong – stkvtflw Jul 15 '22 at 07:58
  • You can use `xargs` or GNU `parallel` to run multiple copies of a program with input divided among them, too. – Shawn Jul 15 '22 at 08:04
  • You may want to use the parallel version, if your data set is big. If your input file is 1M, running single thread decent machine will take 20+ seconds (depending on the complexity of the input). With -J4 you go down to 14, probably faster if you have more cores. See details in my answer – dash-o Jul 15 '22 at 08:17
  • Rather than delete your question, you might consider adding an answer showing that the parallel approach is *"sub-optimal"* and how you can do it faster/better with `jq`. – Mark Setchell Jul 15 '22 at 08:36

1 Answers1

2

Given input order does have have to match output order, try parallel:

parallel -j16 --spreadstdin '(transformation)' < input.json > output.json

Notice that parallel has an option to define the number of jobs based on the number of available cores, to make the script adapt to the actual configuration. Check the man page for options/syntax.

parallel -j0 --spreadstdin '(transformation)' < input.json > output.json

Also this solution will "batch" multiple input lines to jq, reducing the overhead of running jq per line, as is implemented the original post, and per comment froms for @stkvtflw

The --keep-order option can be used to force the input order, at some extra processing time. Per OP, not needed.

tripleee
  • 175,061
  • 34
  • 275
  • 318
dash-o
  • 13,723
  • 1
  • 10
  • 37