0

I'm using make to automate some of my data analysis. I have several directories, each containing a different realization of the data, which consists of several files representing the state of the data at a given time, like so:

├── a
│   ├── time_01.dat
│   ├── time_02.dat
│   ├── time_03.dat
│   └── ...
├── b
│   ├── time_01.dat
│   └── ...
├── c
│   ├── time_01.dat
│   └── ...
├── ...

The number of datafiles in each directory is unknown, and more can be added at any time. The files all have the same naming convention in each directory.

I want to use make to run the exact same set of recipes in each directory (to analyze each dataset separately and uniformly). In particular, there is one script that should run any time a new datafile is added, and creates an output file (analysis_time_XX.txt) for each datafile in the directory. This script does not update any files that have been previously created, but does create all the missing files. Refactoring this script is not a possibility, unfortunately.

So I have one recipe creating many targets, yet it must run separately for each directory. The solutions I've found to create multiple targets with one recipe (e.g. here) do not work in my case, as I need one rule to do this separately for multiple sets of files in separate directories.

These intermediate files are needed for their own sake (as they help validate the data collected), but are also used to create a final comparison plot between the datasets.

My current setup is an ugly combination of functions and .SECONDEXPANSION

dirs = a b c

datafiles = $(foreach dir,$(dirs),$(wildcard $(dir)/*.dat))

df_to_analysis = $(subst .dat,.txt,$(subst time_,analysis_time_,$(1)))
analysis_to_df = $(subst .txt,.dat,$(subst analysis_time_,time_,$(1)))

analysis_files = $(foreach df,$(datafiles),$(call df_to_analysis,$(df)))

all: final_analysis_plot.png

.SECONDEXPANSION:
$(analysis_files): %: $$(call analysis_to_df,%)
    python script.py $(dir $@)

final_analysis_plot.png: $(analysis_files)
    python make_plot.py $(analysis_files)

Note that script.py creates all of the analysis_time_XX.txt files in the given directory. The flaw with this setup is that make does not know that the first script generates all the targets, and so runs unnecessarily when parallel make is used. For my application parallel make is a necessity, as these scripts have a long runtime, and parallelization saves a lot of time as the setup is "embarrassingly parallel."

Is there an elegant way to fix this issue? Or even an elegant way to clean up the code I have now? I've shown a simple example here, which already requires a good bit of setup, and doing this for several different scripts gets unwieldy quickly.

1 Answers1

1

I think, in your case there's no need to bother with .txt files. If script.py was nicer and could work per-file, there would be a value in writing individual file rules. In this case, we need to introduce an intermediate per-directory .done files.

DATA_DIRS := a b c
# A directory/.done.analysis file means that `script.py` was run here.
DONE_FILES := $(DATA_DIRS:%=%/*.done.analysis)

# .done.analysis depends on all the source data files.
# When a .dat file is added or changes, it will be newer than
# a .done.analysis file; and the analysis would be re-run.
$(DONE_FILES): %/.done.analysis: $(wildcard %/*.dat)
    python script.py $(@D)

final_analysis_plot.png: $(DONE_FILES)
    python make_plot.py $(wildcard $(DATA_DIRS)/analysis_time_*.txt)
Victor Sergienko
  • 13,115
  • 3
  • 57
  • 91