2

I have a list of files:

catfish.fa
polar.fa
catfish.ids.txt
polar.ids.txt

I want to run this command for each file with a matching character string. So for example, I'd like to run this:

cat catfish.fa | seqkit grep -f catfish.ids.txt > catfish.output.fa

Similarly...

cat polar.fa | seqkit grep -f polar.ids.txt > polar.output.fa

How can I run this command for each file pair in the directory and in parallel? Thanks for your help!

user3105519
  • 309
  • 4
  • 10

3 Answers3

2
#!/bin/bash

for f in *.fa
do
   filename="${f%.*}"
   if [ -e ${f}.ids.txt ]
   then
      cat ${f}.fa  | seqkit grep -f ${f}.ids.txt >${f}.output.fa
   fi
done

filename="${f%.*}" extracts the filename without extension, see here for an explanation. The purpose of the if is to single out only the files ending with .fa which have a corresponding .ids.txt file. If you want everything to be run in parallel on each pair, append a & at the end of the cat ${f}.fa ... file. (Beware to not generate too many parallel tasks!)

francesco
  • 7,189
  • 7
  • 22
  • 49
1

With bash's Parameter Expansion:

for file in *.fa; do seqkit grep -f "${file%%.*}.id.txt" >"${file%%.*}.output.fa" <"$file" & done
Cyrus
  • 84,225
  • 14
  • 89
  • 153
1

This will run one job per CPU core in parallel:

parallel 'cat {} | seqkit grep -f {.}.ids.txt > {.}.output.fa' ::: *fa

May I suggest you run with --dry-run first, so you can see what will be run?

parallel --dry-run 'cat {} | seqkit grep -f {.}.ids.txt > {.}.output.fa' ::: *fa

Also consider spending 20 minutes on reading chapter 1+2 of the book GNU Parallel 2018 (print: http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html online: https://doi.org/10.5281/zenodo.1146014). Your command line will love you for it.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104