0

I want to do a calculation based on two data files. The calculcation is memory-heavy, so I cannot do them all at once. I split the job into 200 pieces, and then run the calculation on the pieces, which are later combined.
I automated this in a Makefile:

.PHONY: SPLITS QOAC
.SECONDARY: QOAC SPLITS

NSETS = 200
DSETS := $(patsubst %,cache/split_%.rds,$(shell seq 1 1 $(NSETS)))
QSETS := $(patsubst %,cache/qoac_%.rds,$(shell seq 1 1 $(NSETS)))

QOAC: $(QSETS)
SPLITS: $(DSETS)

$(DSETS): split_files.R data/1 data/2
    Rscript $< $(NSETS)

cache/qoac_%.rds: calc_qoac.R cache/split_%.rds
    Rscript $^

bigfile: combine.R QOAC
    Rscript $<

In this example, NSETS pieces are generated by split_files.R, which reads data/1 and data/2. The sets are saved in cache/split_*.rds.
For every split_*, qoac_* is computed using calc_qoac.R. As these processes are isolated, they can be run in parallel by running make -j.
My problem is that if 1(+) of the split_* is missing, split_files.R is run multiple times.

When I add .NOTPARALLEL: SPLITS, the entire script is run serially, which slowes things down.

How can I make sure the generation of the sets is done only once when needed?

Jasper
  • 555
  • 2
  • 12

1 Answers1

0

I managed to get it working by following this link.
It made me use a PHONY target. I thought I already did this, with SPLITS and QOAC, but now I solved it like this:

cache/split_%.rds: SPLITS

SPLITS: split_files_qoac.R data/INR_data.rds data/patient_data.rds
    Rscript $< $(NSETS)

cache/qoac_%.rds: calc_qoac.R cache/split_%.rds
    Rscript $^
Jasper
  • 555
  • 2
  • 12