0

I am writing a GNUmakefile to create a workflow to analyse some biological sequence data. The data comes in a format called fastq, which then undergoes a number of cleaning and analysis tools. I have attached what I currently have written, which takes me all the way from quality control before cleaning and then quality control afterwards. My problem is that I'm not sure how to get the 'fastqc' commands to run, as its targets are not dependencies for any of the other steps in the workflow.

 %_sts_fastqc.html %_sts_fastqc.zip: %_sts.fastq
    # perform quality control after cleaning reads
    fastqc $^

%_sts.fastq: %_st.fastq
    # trim reads based on quality
    sickle se -f $^ -t illumina -o $@

%_st.fastq: %_s.fastq
    # remove contaminated reads
    tagdust -s adapters.fa $^

%_s.fastq: %.fastq
    # trim adapters
    scythe -a <adapters.fa> -o $@ $^

%_fastqc.html %_fastqc.zip: %.fastq
    # perform quality control before cleaning reads
    fastqc $^

%.fastq: %.sra
    # convert .fastq to .sra
    fastq-dump $^
Leandro Papasidero
  • 3,728
  • 1
  • 18
  • 33
jma1991
  • 345
  • 1
  • 3
  • 15

2 Answers2

1

I believe adding these lines to the start of your Makefile will do what you are asking for:

SOURCES:=$(wildcard *.sra)
TARGETS:=$(SOURCES:.sra=_fastqc.html) $(SOURCES:.sra=_fastqc.zip)\
     $(SOURCES:.sra=_sts_fastqc.html) $(SOURCES:.sra=_sts_fastqc.zip)

.PHONY: all
all: $(TARGETS)

What this does is grab all .sra files from the file system and build a list of targets to build by replacing the extension with whatever strings are necessary to produce the targets. (Note the the html and zip targets being produced by the same command I could have one or the other but I've decided to put both, in case the rules change and the hmtl and zip targets are ever produced separately.) Then it sets the phony all target to build all the computed targets. Here is a Makefile I've modified from yours by adding @echo everywhere which I used to check that things were okay without having to run the actual commands in your Makefile. You could copy and paste it in a file to first check that everything is fine before modifying your own Makefile with the lines above. Here it is:

SOURCES:=$(wildcard *.sra)
TARGETS:=$(SOURCES:.sra=_fastqc.html) $(SOURCES:.sra=_fastqc.zip)\
     $(SOURCES:.sra=_sts_fastqc.html) $(SOURCES:.sra=_sts_fastqc.zip)

.PHONY: all
all: $(TARGETS)

%_sts_fastqc.html %_sts_fastqc.zip: %_sts.fastq
# perform quality control after cleaning reads
    @echo fastqc $^

%_sts.fastq: %_st.fastq
# trim reads based on quality
    @echo sickle se -f $^ -t illumina -o $@

%_st.fastq: %_s.fastq
# remove contaminated reads
    @echo tagdust -s adapters.fa $^

%_s.fastq: %.fastq
# trim adapters
    @echo 'scythe -a <adapters.fa> -o $@ $^'

%_fastqc.html %_fastqc.zip: %.fastq
# perform quality control before cleaning reads
    @echo fastqc $^

%.fastq: %.sra
# convert .fastq to .sra
    @echo fastq-dump $^

I tested it here by running touch a.sra b.sra and then running make. It ran the commands for both files.

Louis
  • 146,715
  • 28
  • 274
  • 320
  • I have followed your advice and everything works perfectly. As an aside, I am running my makefile using the following command: 'nohup make -f mymakefile.GNUmakefile -j 7 &'. This dumps all the output from all files which would normally go to the stdout into the nohup file, unfortunately some of the programs used don't report which file they work on, just the results, therefore I can't track which stats are for which .fastq file. Any way to rectify this? – jma1991 Oct 18 '14 at 14:58
0

instead of using patterns, I would use a 'define':

 # 'all' is not a file 
.PHONY: all 
# a list of 4 samples
SAMPLES=S1 S2 S3 S4

#define a macro named analyzefastq. It takes one argument $(1). we need to protect the '$' for later expension using $(eval)
define analyzefastq 
# create a .st.fastq from fastq for file $(1)
$(1).st.fastq  : $(1).fastq
    tagdust -s adapters.fa $$^
# create a .fastq from seq for file $(1)
$(1).fastq : $(1).sra
    fastq-dump $$^

endef

#all : final target  dependency is all samples with a suffix '.st.fastq'
all: $(addsuffix ${S}.st.fastq, ${SAMPLES} )

## loop over each sample , name of variable is 'S' call and eval the previous macro, using 'S'=sample for the argument
$(foreach S,${SAMPLES},$(eval $(call analyzefastq,$(S))) )

I also use my tool jsvelocity https://github.com/lindenb/jsvelocity to generate large Makefile for NGS:

https://gist.github.com/lindenb/3c07ca722f793cc5dd60

Pierre
  • 34,472
  • 31
  • 113
  • 192
  • Would you be able to comment your code so I can get a grasp of what each line does? I'm just starting to learn make for use in workflows (mainly influenced by http://www.bioinformaticszen.com/post/decomplected-workflows-makefiles/) – jma1991 Oct 17 '14 at 14:30
  • Thank you for the updated comments. If I understand correctly, my quality control commands should be in a different macro than that of analyzefastq, as within that macro they still follow a pattern rule? – jma1991 Oct 17 '14 at 14:56
  • you can put all your sample-specific methods in the same macro (here 'analyzefastq'). Patterns are complex because something like `%_fastqc.html %_fastqc.zip: %.fastq` is not what you expect: the command will be invoked twice. one for *.html, a second for *.zip. See http://stackoverflow.com/questions/2973445 – Pierre Oct 17 '14 at 15:11
  • 2
    @Pierre you are mistaken about the meaning of `%_fastqc.html %_fastqc.zip: %.fastq`: pattern rules are actually the _only_ way to get a single command that outputs multiple targets. With non-pattern rules, your statement is correct. See http://www.cmcrossroads.com/article/rules-multiple-outputs-gnu-make – Eric Melski Oct 17 '14 at 17:02