5

Requirement: I have a Json file which is in .gz format. So, when it is compressed it is around ~500 MB in size. When I extract it, the json file becomes nearly around ~10 GB. The extracted JSON file contains individual JSON objects line by line.What I want is to sort the file based on a field ps using either any bash script or python programs.

Because the file is too large, its not advisable to load it into memory. So, I used gzcat and cat bash command to stream the JSON data and then pipe them to jq for sorting purpose. But either the system doesn't respond during the process or I get empty file in the output.json

>cat  sth2.json | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'"  > "output.json"
>gzcat  sth2.json.gz | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'"  > "output.json"

Hardware: 16GB RAM, core i5 processor

Sample JSON Data:-

{
    "ps":"abc"
    ....
}
{   
    "ps":"def"
    ......
}
{
    "ps":"abc"
    ....
}

Expected output:

{
    "ps":"abc"
    ....
}
{   
    "ps":"abc"
    ....
}
{
    "ps":"def"
    ....
}

I don't understand what I am doing wrong. Can anyone suggest how to sort such huge JSON file ? Links I followed: https://github.com/joelpurra/jq-hopkok/tree/master/src/parallelism

Also, is there any way I can do via any Map reduce without Hadoop ?

Approach-1: Streaming data to local Sqlite DB.

import sqlite3
import fileinput

PATH=".../sqlite-snapshot-201904101324/testDB.db"
insert_query="INSERT INTO feeds (data) VALUES (?)"

def db_connect(db_path=PATH):
    con = sqlite3.connect(db_path)
    return con

con = db_connect() # connect to the database
cur = con.cursor() # instantiate a cursor obj

record_count = 0
for line in fileinput.input():
    cur.execute(insert_query,(line,))

command line:

>gzcat sth.json.gz | python insert.py
dks551
  • 1,113
  • 1
  • 15
  • 39
  • Then how can sort the JSON file ? Is there any utility available or anything else ? – dks551 Apr 16 '19 at 16:29
  • yes, correct I tested with some 20 rows and I could see them. – dks551 Apr 16 '19 at 16:30
  • 2
    If you can e.g. prefix the lines with the sort key so that they can be sorted as text rather than JSON, then GNU sort can easily sort 10GB+ files without loading them into memory. – that other guy Apr 16 '19 at 16:49
  • I interpreted "individual JSON objects line by line" as JSONL – that other guy Apr 16 '19 at 17:26
  • @Tomalak in the above comment you have mentioned stream the .gz file content to sqlite using python. can you please provide more details how to do that – dks551 Apr 16 '19 at 17:40
  • @Tomalak I tried to stream the data to sqlite db, but looks like its running forever. May be I am missing something here. updated my code with this approach – dks551 Apr 17 '19 at 04:23

2 Answers2

6

Here is one solution based on the suggestion in one of the comments:

If you can e.g. prefix the lines with the sort key so that they can be sorted as text rather than JSON, then GNU sort can easily sort 10GB+ files without loading them into memory. – that other guy

You can use jq to do this along the following lines:

jq -cr '"\(.ps)\t\(.)"' 

This will produce lines with tab-separated values like so:

abc {"ps":"abc","x":0}
abc {"ps":"abc","x":1}

Using the -c option ensures that each pair (i.e. the sorting key and object) is written to a single line.

Now you can easily sort the lines, e.g. using sort; and then use e.g. cut to strip the .ps field.

Finally, if you really want the output to be formatted, you can again use jq ( e.g. jq .), the point being that jq is by default stream-oriented.

Caveat

The above assumes that the .ps values are tab-free. If that is not the case, then you could either use a different field-separator, or:

jq -cr '([.ps] | @tsv) + "\t" + tostring'
peak
  • 105,803
  • 17
  • 152
  • 177
  • @peak correct me if I am wrong, but when you sort, then the whole json data will be loaded into memory. – dks551 Apr 17 '19 at 04:15
  • `gzcat sth.json.gz | jq -cr '"\(.ps)\t\(.)"' | sort | cut -f 2 >> abc` this is the command I used. and its running like forever. – dks551 Apr 17 '19 at 04:16
  • Use the -k option of `sort` so you're only sorting on one key. Maybe fine-tuning with the -S option might also help. You might also want to check whether your `sort` is sufficiently uptodate. – peak Apr 17 '19 at 05:44
  • Also you might want to read up on the -T option, and the role of TMPDIR. – peak Apr 17 '19 at 13:11
0

Instead of being concerned about different kinds of edge cases you can instead use base64 encoding and decoding.

This approach creates two columns that are tab separated. The first column contains the values used for sorting. The second is the complete JSON in base64 format.

After sorting we get the second column and base64 decode.

gzcat sth.json.gz | \
  | jq -cr '[(. | .ps | @text), (. | @base64)] | @tsv' \ # Create a TSV where the first column contains the value that should be sorted on.
  | sort -t$'\t' -k1,1 \ # Sort only on first column.
  | cut -f 2 \ # Get the base64 encoded JSON from seconds column.
  | base64 -d \ # Decode the base64 encoded JSON. (Ignores newlines)
  | jq -c . # Required to place each JSON on separate line.
bas080
  • 341
  • 3
  • 9