parallelize awk script - files splitting

Question

I have a small awk script which takes input from a stream and writes to the appropriate file based on the second column value. Here is how it goes:

cat mydir/*.csv | awk -F, '{if(NF==29)print $0 >> "output/"$2".csv"}'

How do I parallelize it, so that it can use multiple cores available in the machine? Right now, this is running on a single core.

This looks like an IO bound process, so parallelizing it will not make it faster. — Thor, Jan 03 '19 at 18:56
Is the 'stream' really the concatenation of a set of files? Or is that just a simple example to emulate the source? It matters, because the obvious way to improve what's shown is to run `awk` on each file separately and in parallel. Be aware that `awk` may have an upper bound on how many files it can have open for output at once. — Jonathan Leffler, Jan 04 '19 at 06:11

Ole Tange · Answer 1 · 2019-01-05T08:57:14.053

Untested:

do_one() {
  # Make a workdir only used by this process to ensure no files are added to in parallel
  mkdir -p  $1
  cd $1
  cat ../"$2" | awk -F, '{if(NF==29)print $0 >> $2".csv"}'
}
export -f do_one
parallel do_one workdir-{%} {} ::: mydir/*.csv 
ls workdir-*/ | sort -u |
   parallel 'cat workdir*/{} > output/{}'
rm -rf workdir-*

If you want to avoid the extra cat you can use this instead, though I find the cat version easier to read (performance is normally the same on modern systems http://oletange.blogspot.com/2013/10/useless-use-of-cat.html):

do_one() {
  # Make a workdir only used by this process to ensure no files are added to in parallel
  mkdir -p  $1
  cd $1
  awk -F, <../"$2" '{if(NF==29)print $0 >> $2".csv"}'
}
export -f do_one
parallel do_one workdir-{%} {} ::: mydir/*.csv 
ls workdir-*/ | sort -u |
   parallel 'cat workdir*/{} > output/{}'
rm -rf workdir-*

But as @Thor writes, you are most likely I/O starved.

Probably want to refactor to avoid the [useless `cat`](/questions/11710552/useless-use-of-cat); `awk -F, 'NF==29 { print >> $2 ".csv" }' ../"$2"` — tripleee, Jan 04 '19 at 06:06

score 0 · Answer 2 · answered Jan 04 '19 at 09:10

you can try this.

I execute 1 awk per source file. Put content in temporary file (in each process it is a series of different one to avoid conflict in same final file and/or too much open/close handle on it). At the end of the awk, it put the content of temporary file into final one and remove temporary

you maybe have to use a batch limiter (a sleep or more smart grouping) if there are lot of file to treat to avoid to kill the machine with too much subprocess concurrent.

rm output/*.csv
for File in mydir/*.csv
 do
   # shell sub process
   {
   # ref for a series of temporary file
   FileRef="${File##*/}"

   awk -F ',' -v FR="${FileRef}" '
      NF == 29 {
         # put info in temporary file
         ListFiles [ OutTemp = "output/"$2".csv_" FR ] = "output/"$2".csv"
         print > OutTemp}
      END {
        # put temporary content into final file
        for ( TempFile in ListFiles ) {
           Command = sprintf( "cat \042%s\042 >> \042%s\042; rm \042%s\042" \
              , TempFile, ListFiles[TempFile], TempFile )
           printf "" | Command
           }
      ' File
    } &
 done

wait
echo ls -l output/*.csv

Why would you run the shell in a subprocess? Surely just running Awk in the background should suffice. — tripleee, Jan 04 '19 at 09:22
yes, just because at the begining i was thinking of transfert temporary file content out of the awk process to limit subshell from awk and also to keep the whole part in the same subprocess for any post awk like a sort and i keep it like that because it does not change a lot in performance or memory use. — NeronLeVelu, Jan 04 '19 at 13:32

parallelize awk script - files splitting

2 Answers2