1

I'm working on a very common set of commands used to analyze RNA-seq data. However, since this question is not specific to bioinformatics, I've chosen to post here instead of BioStars, etc.

Specifically, I am trimming Illumina Truseq adapters from paired end sequencing data. To do so, I use Trimmomatic 0.36.

I have two input files:

S6D10MajUnt1-1217_S12_R1_001.fastq.gz
S6D10MajUnt1-1217_S12_R2_001.fastq.gz

And the command generates five output files:

S6D10MajUnt1-1217_S12_R1_001.paired.fq.gz
S6D10MajUnt1-1217_S12_R1_001.unpaired.fq.gz
S6D10MajUnt1-1217_S12_R2_001.paired.fq.gz
S6D10MajUnt1-1217_S12_R2_001.unpaired.fq.gz
S6D10MajUnt1-1217_S12.trimlog

I'm trying to write a python or bash script to recursively loop over all the contents of a folder and perform the trim command with appropriate files and outputs.

#!/bin/bash
for DR in *.fastq.gz
do
FL1=$(ls ~/home/path/to/files/${DR}*_R1_*.fastq.gz)
FL2=$(ls ~/home/path/to/files/${DR}*_R2_*.fastq.gz)
java -jar ~/data2/RJG_Seq/apps/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 12 -phred33 -trimlog ~/data2/RJG_Seq/trimming/sample_folder/$FL1.trimlog ~/data2/RJG_Seq/demultiplexing/sample_folder/$FL1 ~/data2/RJG_Seq/demultiplexing/sample_folder/$FL2 ~/data2/RJG_Seq/trimming/sample_folder/$FL1.pair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL1.unpair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL2.pair.fq.gz ~/data2/RJG_Seq/trimming/sample_folder/$FL2.unpair.fq.gz ILLUMINACLIP:/media/RJG_Seq/apps/Trimmomatic-0.36/TruSeq3-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:28
done

I believe there's something wrong with the way I am assigning and invoking FL1 and FL2, and ultimately I'm looking for help creating an excecutable command trim-my-reads.py or trim-my-reads.sh that could be modified to accept any arbitrarily named input R1.fastq.gz and R2.fastq.gz files.

anothermh
  • 9,815
  • 3
  • 33
  • 52
  • Is this actually Python? – Xantium Jan 24 '18 at 19:15
  • Its not a bash (or ksh) script because there is whitespace around the `=` in the assignments. Maybe that's the problem and you "just" got the language wrong? In most shells there is no whitespace allowed around the `=`. – cdarke Jan 24 '18 at 19:18
  • Well certainly it may be a bash for loop that wishes it was a python for loop. As I understand it, bash can't handle multiple input files without using C. – Riley J. Graham Jan 24 '18 at 19:26
  • Bash can handle multiple input files using `exec` (the file descriptor version, not the "replace program" version). See https://stackoverflow.com/questions/18351198/what-are-the-uses-of-the-exec-command-in-shell-scripts/18351547#18351547. But I don't think that is what the question is about. – cdarke Jan 24 '18 at 19:39
  • Why are you looping over those `*.fastq.gz` filenames when you don't use the filename (`$i`) anywhere in the body of the loop? – cdarke Jan 24 '18 at 19:47
  • If you want the globbing ("wildcards") to be expanded in `F1` and `F2` then you should remove the quotes. Both input files are mentioned as parameters to the `java` command, so do you need to run that more than once? – cdarke Jan 24 '18 at 19:56
  • Thanks for the suggestions and questions cdarke, that cleared up at least one of the things that is wrong here. I need to first assign the R1 and R2 fastq files to FL1 and FL2 and invoke them in the body of the for loop. I've edited the code to reflect this change. – Riley J. Graham Jan 24 '18 at 20:09

2 Answers2

1

You can write a simple python script to loop over all the files in a folder.

Note : I have assumed that the output files will be generated in a folder named "example"

import glob
for file in glob.glob("*.fastq.gz"):
    #here you'll unzip the file to a folder assuming "example"
    for files in glob.glob("/example/*"):
        #here you can parse all the files inside the output folder
Jaysheel Utekar
  • 1,171
  • 1
  • 19
  • 37
  • This is a helpful example, thanks! in my case I have two input files, Sample1_R1_.fastq.gz and Sample1_R2_.fastq.gz, each of which produces two files, as in Sample1_R1_paired.fastq.gz and Sample1_R1_unpaired.fastq.gz. My question is how to assign R1 and R2 to distinct variables in the loop, and then run the java command recursively for each pair of R1 and R2 sample files. – Riley J. Graham Jan 25 '18 at 15:59
1

Each pair of samples has a matching string (SN=sample N) A solution to this question in bash could be:

#!/bin/bash
#apply loop function to samples 1-12
for SAMPLE in {1..12} 
do
#set input file 1 to "FL1", input file 2 to "FL2"
FL1=$(ls ~path/to/input/files/_S${SAMPLE}_*_R1_*.gz)
FL2=$(ls ~path/to/input/files/_S${SAMPLE}_*_R2_*.gz)

#invoke java ,send FL1 and FL2 to appropriate output folders
java -jar ~/path/to/trimming/apps/Trimmomatic-0.36/trimmomatic-0.36.jar PE
-threads 12 -phred33 -trimlog ~/path/to/output/folders/${FL1}.trimlog
~/path/to/input/file1/${FL1} ~/path/to/input/file2/${FL2} 
~/path/to/paired/output/folder/${FL1}.pair.fq.gz ~/path/to/unpaired/output/folder/${FL1}.unpair.fq.gz 
~/path/to/paired/output/folder/${FL2}.pair.fq.gz ~/path/to/unpaired/output/folder/${FL2}.unpair.fq.gz 
ILLUMINACLIP:/path/to/trimming/apps/Trimmomatic-0.36/TruSeq3-PE.fa:2:30:10 LEADING:5 TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:28

#add verbose option to track progress
echo "Sample ${SAMPLE} done"
done

This is an inelegant solution, because it requires the format I'm using. A better method would be to grep each filename and assign them to FL1, FL2 accordingly, because this would generalize the method. Still, this is what worked for me, and I can easily control which samples are subjected to the for loop, as long as I always have the _S * _ format in the filename strings.