Requirement: I have a Json file which is in .gz format. So, when it is compressed it is around ~500 MB in size. When I extract it, the json file becomes nearly around ~10 GB. The extracted JSON file contains individual JSON objects line by line.What I want is to sort the file based on a field ps
using either any bash script or python programs.
Because the file is too large, its not advisable to load it into memory. So, I used gzcat and cat bash command to stream the JSON data and then pipe them to jq for sorting purpose. But either the system doesn't respond during the process or I get empty file in the output.json
>cat sth2.json | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'" > "output.json"
>gzcat sth2.json.gz | parallel --pipe --group --block 1000M --recend '\n}\n' "jq -s -c 'sort_by(.ps) | .[]'" > "output.json"
Hardware: 16GB RAM, core i5 processor
Sample JSON Data:-
{
"ps":"abc"
....
}
{
"ps":"def"
......
}
{
"ps":"abc"
....
}
Expected output:
{
"ps":"abc"
....
}
{
"ps":"abc"
....
}
{
"ps":"def"
....
}
I don't understand what I am doing wrong. Can anyone suggest how to sort such huge JSON file ? Links I followed: https://github.com/joelpurra/jq-hopkok/tree/master/src/parallelism
Also, is there any way I can do via any Map reduce without Hadoop ?
Approach-1: Streaming data to local Sqlite DB.
import sqlite3
import fileinput
PATH=".../sqlite-snapshot-201904101324/testDB.db"
insert_query="INSERT INTO feeds (data) VALUES (?)"
def db_connect(db_path=PATH):
con = sqlite3.connect(db_path)
return con
con = db_connect() # connect to the database
cur = con.cursor() # instantiate a cursor obj
record_count = 0
for line in fileinput.input():
cur.execute(insert_query,(line,))
command line:
>gzcat sth.json.gz | python insert.py