I would like to process a 200 GB file with lines like the following:
...
{"captureTime": "1534303617.738","ua": "..."}
...
The objective is to split this file into multiple files grouped by hours.
Here is my basic script:
#!/bin/sh
echo "Splitting files"
echo "Total lines"
sed -n '$=' $1
echo "First Date"
head -n1 $1 | jq '.captureTime' | xargs -i date -d '@{}' '+%Y%m%d%H'
echo "Last Date"
tail -n1 $1 | jq '.captureTime' | xargs -i date -d '@{}' '+%Y%m%d%H'
while read p; do
date=$(echo "$p" | sed 's/{"captureTime": "//' | sed 's/","ua":.*//' | xargs -i date -d '@{}' '+%Y%m%d%H')
echo $p >> split.$date
done <$1
Some facts:
- 80 000 000 lines to process
jq
doesn't work well since some JSON lines are invalid.
Could you help me to optimize this bash script?
Thank you