Does awk have any built-in support for preventing writing to the same file that another instance of awk is already writing to?
Consider the following:
$ # Create large input file
$ for i in {1..500000}; do echo "$i,$i,$i" >> /tmp/LargeFile.txt; done
$ # Launch two simultaneous instances of awk outputting to the same file
$ awk -F"," '{print $0}' /tmp/LargeFile.txt >> /tmp/OutputFile.txt & awk -F"," '{print $0}' /tmp/LargeFile.txt >> /tmp/OutputFile.txt &
$ # Find out how many fields are in each line (ideally 3)
$ awk -F"," '{print NF}' /tmp/Output.txt | sort | uniq -c
1 0
553 1
1282 2
996412 3
1114 4
638 5
Thus two awk instances are simultaneously outputting a lot of data to the same file. Ideally the output file would have three comma-separate values per line, but since both instances are writing to the same file at the same time, some lines may have more than 3 comma-separated values, and some will have fewer than 3.
Example corrupt output file:
1,1,1 < 1's from from first instance of awk
2,2,2 < 2's from from first instance of awk
3,3,3 < 3's from from first instance of awk
1,1,1 < 1's from from second instance of awk
2,2,2 < 2's from from second instance of awk
4,4,4 < 4's from from first instance of awk
5,5,5 < 5's from from first instance of awk
3,3,3 < 3's from from second instance of awk
4,6,6,4,6 < corrupted input as both instances tried to write to this line at the same time
4
7,7,7 < 7's from from first instance of awk
Are there any good and simple methods to prevent this?
Edit - adding more detail from the actual scenario:
The processing done by each instance of awk will be more like this: Data from other processes is continuously written to files, e.g. every 5 minutes there is a new file. Multiple instances of awk will be invoked to process/aggregate the data on set intervals (say every 30 minutes).
cat SomeFilesWithLotsOfData | awk '
{
# process lots of data which takes a lot of time
# build up associate arrays based on input
}
END {
# Output processed data which takes little time
# Loop over associative arrays and output to persistent files
}'
Say the processing portion (before the END statements) takes 30 minutes to complete (wow that's a long time, but lets go with it for illustration). A second instance of this same awk script may be instantiated to process a new batch of files with data before this first one ends, and it needs to output its processed data to the same files as the previous instance. The exact number of output files each awk instance outputs to is dependent on the input (i.e. its based on a particular field in the input records). I don't want to lock all of the possible output files before the input is processed, because I don't know which awk instance will complete processing first. So presently I am planning to create a lock at the beginning of the END and unlock it after the END, but my implementation is a little clunky so I'm looking for a superior method.