Splitting out a large file

Question

I would like to process a 200 GB file with lines like the following:

...
{"captureTime": "1534303617.738","ua": "..."}
...

The objective is to split this file into multiple files grouped by hours.

Here is my basic script:

#!/bin/sh

echo "Splitting files"

echo "Total lines"
sed -n '$=' $1

echo "First Date"
head -n1 $1 | jq '.captureTime' | xargs -i date -d '@{}' '+%Y%m%d%H'

echo "Last Date"
tail -n1 $1 | jq '.captureTime' | xargs -i date -d '@{}' '+%Y%m%d%H'

while read p; do
  date=$(echo "$p" | sed 's/{"captureTime": "//' | sed 's/","ua":.*//' | xargs -i date -d '@{}' '+%Y%m%d%H')
  echo $p >> split.$date
done <$1

Some facts:

80 000 000 lines to process
jq doesn't work well since some JSON lines are invalid.

Could you help me to optimize this bash script?

Thank you

kvantour · Accepted Answer · 2018-12-14T16:29:38.990

3

This awk solution might come to your rescue:

awk -F'"' '{file=strftime("%Y%m%d%H",$4); print >> file; close(file) }' $1

It essentially replaces your while-loop.

Furthermore, you can replace the complete script with:

# Start AWK file
BEGIN{ FS='"' }
(NR==1){tmin=tmax=$4}
($4 > tmax) { tmax = $4 }
($4 < tmin) { tmin = $4 }
{ file="split."strftime("%Y%m%d%H",$4); print >> file; close(file) }
END {
  print "Total lines processed: ", NR
  print "First date: "strftime("%Y%m%d%H",tmin)
  print "Last date:  "strftime("%Y%m%d%H",tmax)
}

Which you then can run as:

awk -f <awk_file.awk> <jq-file>

Note: the usage of strftime indicates that you need to use GNU awk.

edited Dec 14 '18 at 16:29

answered Aug 22 '18 at 17:49

kvantour

25,269
4
47
72

Thanks for giving me this awk snippet, in this related question they don't use `>>` operator https://stackoverflow.com/questions/11687054/split-access-log-file-by-dates-using-command-line-tools I think this is where the computation cost comes from. – Michel Hua Aug 22 '18 at 17:54
I'm not following what you mean with _computation cost_. The reason for using `>>` instead of `>` is because I am not aware how many files will be created. If it is only a few files, one can replace the line with `{file="split."strftime("%Y%m%d%H",$4); print > file;}`. However, if it is many files, then you need to close the file and reopen. Hence the usage of `>>` as this appends to an existing file. – kvantour Aug 22 '18 at 17:59
Ok in my initial script there is one write at each iteration of read line. I was wondering if awk did the writes by batch or line by line. In my case I have about 200 files – Michel Hua Aug 22 '18 at 18:03
In you initial script you read your file `$1` four times. While the complete `awk` script reads it only a single time. I believe that is where most of the speedup will come from. On top of that, in the loop, per iteration, you call sed twice and `echo`, `xargs` and `date` ones each. But these are all binaries which require processing time to load and execute. With `awk` you never do that. That is why the `awk` line presented above will be faster than the while loop, as it mimics exactly what you wrote (including the usage of `echo $p >> split.$date` which in awk is `print >> file;close(file);` – kvantour Aug 22 '18 at 18:10
For 200 files, you can just use the line presented in previous comment. This will be even faster. – kvantour Aug 22 '18 at 18:12
Last question : how does `$4` parse the captureTime pattern ? Found the answer. https://askubuntu.com/questions/342842/what-does-this-command-mean-awk-f-print-4 Thank you very much for your help. – Michel Hua Aug 22 '18 at 18:14
I told `awk` to use `"` as a field-delimiter. (in the first example using the option `-F '"'` and in the awk-script with `BEGIN{ FS='"' }`. In this case, the `captureTime` is the fourth field, i.e. `$4` – kvantour Aug 22 '18 at 18:17
A great example of replacing many calls to externals (echo, date, etc) and replacing with one call to awk. Good luck to all. – shellter Aug 22 '18 at 19:28
How can I gzip output files ? I tried to follow this https://stackoverflow.com/questions/21698296/awk-gzip-output-to-multiple-files but the `awk` syntax is hard. – Michel Hua Aug 23 '18 at 10:27
@MichelHua : Post a new Q with your best attempt to solve the problem. Good luck. – shellter Aug 23 '18 at 11:50

score 2 · Answer 2 · edited Jun 20 '20 at 09:12

2

you can start optimizing by changing this sed 's/{"captureTime": "//' | sed 's/","ua":.*//' with this sed -nE 's/(\{"captureTime": ")([0-9\.]+)(.*)/\2/p'

-n suppress automatic printing of pattern space

-E use extended regular expressions in the script

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 22 '18 at 16:01

oguz ismail

1
16
47
69

Splitting out a large file

2 Answers2

Linked