0

I am trying to write a for loop where I conditionally parse specific values from a csv file into the do command.

My situation is as follows: I have several directories containing genome sequences. The samples are numbered and the directories are named accordingly.

Dir 1 contains sample1_genome.fasta
Dir 2 contains sample2_genome.fasta
Dir 3 contains sample3_genome.fasta

The genome sequences have differing average read lengths. It is important to adress this. Therefore, I created a csv file containing the sample number and the according average read length of the genome sequence. csv file example (first column = sample_no, 2nd column = avg_read_length):

1,130
2,134
3,129

Now, I want to loop through the directories, take the genome sequences as input and parse the respective average read length to the process.

my code is as follows:

for f in *
do 
     shortbred_quantify.py --genome $f/sample${f%}.fasta --aerage_read_length *THE SAMPLE MATCHING VALUE FROM 2nd COLUMN* --results results/quantify_results_sample${f%}
done

Can you help me out with this?

plicht
  • 123
  • 6
  • 1
    Your example is not a csv file, and if it doesn't have headers then don't include it. So... are you passing the csv file as input to a script (i.e. what is *?). – Allan Wind Dec 07 '21 at 14:14
  • I edited the table to csv format. I run the loop directly in the terminal. The asterisk stands for directories containing genome sequences of samples. The directories are named according to the samples, e. g. 1, 2, 3 – plicht Dec 07 '21 at 14:18

2 Answers2

0

I would structure it along these lines:

while IFS=, read sample read_length
do
    shortbred_quantify.py --genome "$sample/genome_sample.fasta" --avgreadBP "$read_length" --results "results/quantify_results_sample$sample"
done < ./input
Allan Wind
  • 23,068
  • 5
  • 28
  • 38
  • Thanks a lot for you help! How can I then loop through the directories to take the particular sample files as input? I need to bring the read_length with the respective sample file together – plicht Dec 07 '21 at 14:38
  • You can replace your.csv with a glob (*/*.csv) or whatever. Your question is not really clear, so I suggest you update it to be more precise. Like do you need to select by read_length then use sample to identify the file? – Allan Wind Dec 07 '21 at 14:42
  • I updated my initial post – plicht Dec 07 '21 at 14:58
  • @plicht good job on refining the question. Did you have a chance to give my updated answer a whirl? – Allan Wind Dec 07 '21 at 21:40
0

Use awk. $2 is the second field, $1 is the first. eg:

$ cat input
1,130
2,134
3,129
$ awk '$2 == avgReadBP{ print $1 }' FS=, avgReadBP=134 input
2

So your command ends up looking like:

input="$f"/genome_sample.fasta
shortbred_quantify.py --genome "$input" \
    --avgreadBP "$(awk '$2 == a{ print $1 }' FS=, a="$value_to_match" "$input")" \
    --results results/quantify_results_sample"${f}"

Don't forget to quote the filename.

William Pursell
  • 204,365
  • 48
  • 270
  • 300