I want to do a calculation based on two data files. The calculcation is memory-heavy, so I cannot do them all at once. I split the job into 200 pieces, and then run the calculation on the pieces, which are later combined.
I automated this in a Makefile:
.PHONY: SPLITS QOAC
.SECONDARY: QOAC SPLITS
NSETS = 200
DSETS := $(patsubst %,cache/split_%.rds,$(shell seq 1 1 $(NSETS)))
QSETS := $(patsubst %,cache/qoac_%.rds,$(shell seq 1 1 $(NSETS)))
QOAC: $(QSETS)
SPLITS: $(DSETS)
$(DSETS): split_files.R data/1 data/2
Rscript $< $(NSETS)
cache/qoac_%.rds: calc_qoac.R cache/split_%.rds
Rscript $^
bigfile: combine.R QOAC
Rscript $<
In this example, NSETS
pieces are generated by split_files.R
, which reads data/1
and data/2
. The sets are saved in cache/split_*.rds
.
For every split_*
, qoac_*
is computed using calc_qoac.R
. As these processes are isolated, they can be run in parallel by running make -j
.
My problem is that if 1(+) of the split_*
is missing, split_files.R
is run multiple times.
When I add .NOTPARALLEL: SPLITS
, the entire script is run serially, which slowes things down.
How can I make sure the generation of the sets is done only once when needed?