Prevent two or more instances of awk from simultaneously writing to the same file

Question

Does awk have any built-in support for preventing writing to the same file that another instance of awk is already writing to?

Consider the following:

$ # Create large input file
$ for i in {1..500000}; do echo "$i,$i,$i" >> /tmp/LargeFile.txt; done
$ # Launch two simultaneous instances of awk outputting to the same file
$ awk -F"," '{print $0}' /tmp/LargeFile.txt >> /tmp/OutputFile.txt & awk -F"," '{print $0}' /tmp/LargeFile.txt >> /tmp/OutputFile.txt &
$ # Find out how many fields are in each line (ideally 3)    
$ awk -F"," '{print NF}' /tmp/Output.txt | sort | uniq -c
          1 0
        553 1
       1282 2
     996412 3
       1114 4
        638 5

Thus two awk instances are simultaneously outputting a lot of data to the same file. Ideally the output file would have three comma-separate values per line, but since both instances are writing to the same file at the same time, some lines may have more than 3 comma-separated values, and some will have fewer than 3.

Example corrupt output file:

1,1,1   < 1's from from first instance of awk
2,2,2   < 2's from from first instance of awk
3,3,3   < 3's from from first instance of awk
1,1,1   < 1's from from second instance of awk
2,2,2   < 2's from from second instance of awk
4,4,4   < 4's from from first instance of awk
5,5,5   < 5's from from first instance of awk
3,3,3   < 3's from from second instance of awk
4,6,6,4,6   < corrupted input as both instances tried to write to this line at the same time
4
7,7,7   < 7's from from first instance of awk

Are there any good and simple methods to prevent this?

Edit - adding more detail from the actual scenario:

The processing done by each instance of awk will be more like this: Data from other processes is continuously written to files, e.g. every 5 minutes there is a new file. Multiple instances of awk will be invoked to process/aggregate the data on set intervals (say every 30 minutes).

cat SomeFilesWithLotsOfData | awk '
{
    # process lots of data which takes a lot of time
    # build up associate arrays based on input
}
END {
    # Output processed data which takes little time
    # Loop over associative arrays and output to persistent files
}'

Say the processing portion (before the END statements) takes 30 minutes to complete (wow that's a long time, but lets go with it for illustration). A second instance of this same awk script may be instantiated to process a new batch of files with data before this first one ends, and it needs to output its processed data to the same files as the previous instance. The exact number of output files each awk instance outputs to is dependent on the input (i.e. its based on a particular field in the input records). I don't want to lock all of the possible output files before the input is processed, because I don't know which awk instance will complete processing first. So presently I am planning to create a lock at the beginning of the END and unlock it after the END, but my implementation is a little clunky so I'm looking for a superior method.

There could be some ways to prevent this. 1-`awk` 2- Let first command complete and then it will automatically come to 2nd command and your file will be concatenated then. 2- In case you want to run them in background then use different output files and once they are done then concatenate them or write them in same file. Let me know if anyone of this helps you? — RavinderSingh13, Jun 14 '18 at 02:30
Thanks RavinderSingh13. I can't guarantee that the two awk processes won't be running at the same it, and I'm hoping to avoid the added complexity of concatenating the files at the end (which in my actual scenario would be complicated to implement). So I'm hoping that awk might have some nice built-in feature. — Rusty Lemur, Jun 14 '18 at 02:36
As already suggested why don't you write 2 different files and then simple `cat` them to 1 single output? By doing this no confusions will be there to be honest. — RavinderSingh13, Jun 14 '18 at 02:44
can't you just use `sem` or `pidlock` to be sure that only one instance of `awk` is running? otherwise you can just use a lock on the file to avoid collisions: `lockfile` command can help you — Allan, Jun 14 '18 at 02:45
The confusion for writing to multiple files and then concatenating them is that there are actually an indefinite number of output files which each instance of awk may be writing to. In my example there are only two, but in actuality there will be many per instance of awk, and each file needs to be protected. How would awk use lockfile to protect a file? — Rusty Lemur, Jun 14 '18 at 03:47

score 1 · Answer 1 · answered Feb 02 '19 at 18:48

There is a similar question at Quick-and-dirty way to ensure only one instance of a shell script is running at a time

The flock(1) solution is probably simplest if that command exists on your system.

One option is to simply wrap all invocations of your awk script:

flock -x /var/lock/myscriptlockfile awk ...

This will serialise invocations of your awk script so that only one may be running at a time. You can adjust the flock call so that it terminates after some time instead of waiting forever if you want to decide whether to retry later or just skip.

To allow multiple copies of the script to be running but only allow one to write at a time, you can tweak this solution to call flock from within your END to wrap the so-called 'critical section' with something like:

awk 300>/var/lock/myscriptlockfile '
    # ...
    END {
        system("flock -x 300");
        # critical section
        system("flock -u 300");
    }
'

Prevent two or more instances of awk from simultaneously writing to the same file

Edit - adding more detail from the actual scenario:

1 Answers1