0

I have list of files where for each file there are two set of files forward and reverse.

KIMS2021-01_R1.fastq.gz  KIMS2021-05_R2.fastq.gz  SRR1734377_1.fastq.gz  SRR6006898_2.fastq.gz  SRR6006903_1.fastq.gz
KIMS2021-01_R2.fastq.gz  KIMS2021-06_R1.fastq.gz  SRR1734377_2.fastq.gz  SRR6006899_1.fastq.gz  SRR6006903_2.fastq.gz
KIMS2021-02_R1.fastq.gz  KIMS2021-06_R2.fastq.gz  SRR6006895_1.fastq.gz  SRR6006899_2.fastq.gz  SRR6006904_1.fastq.gz
KIMS2021-02_R2.fastq.gz  SRR1734374_1.fastq.gz    SRR6006895_2.fastq.gz  SRR6006900_1.fastq.gz  SRR6006904_2.fastq.gz
KIMS2021-03_R1.fastq.gz  SRR1734374_2.fastq.gz    SRR6006896_1.fastq.gz  SRR6006900_2.fastq.gz  SRR6006905_1.fastq.gz
KIMS2021-03_R2.fastq.gz  SRR1734375_1.fastq.gz    SRR6006896_2.fastq.gz  SRR6006901_1.fastq.gz  SRR6006905_2.fastq.gz
KIMS2021-04_R1.fastq.gz  SRR1734375_2.fastq.gz    SRR6006897_1.fastq.gz  SRR6006901_2.fastq.gz  SRR6006906_1.fastq.gz
KIMS2021-04_R2.fastq.gz  SRR1734376_1.fastq.gz    SRR6006897_2.fastq.gz  SRR6006902_1.fastq.gz  SRR6006906_2.fastq.gz
KIMS2021-05_R1.fastq.gz  SRR1734376_2.fastq.gz    SRR6006898_1.fastq.gz  SRR6006902_2.fastq.gz

My objective is to pass these files for input which simple when all the files are having similar naming pattern followed here i have data coming from two different sources..

This is the command i do run

for i in $(ls *.fastq*.gz | sed 's/00[0-9]\.gz/.gz/' | rev | cut -c 17- | rev | uniq); do STAR --runMode alignReads --outSAMtype BAM SortedByCoordinate --runThreadN 30 --genomeDir /run/media/punit/data3/Santosh_star_index --readFilesIn  <(gunzip -c ${i}_R1_001.fastq.gz ${i}_R2_001.fastq.gz ) --outFileNamePrefix ${i%};done

The idea is I should get one file name for each of the set.

This command works for the files which starts with SRR ids, as i have tried

for i in $(ls *.fastq*.gz | sed 's/00[0-9]\.gz/.gz/' | rev | cut -c 12- | rev | uniq); do echo $i; done

The output of the above is as such

KIMS2021-01_
KIMS2021-02_
KIMS2021-03_
KIMS2021-04_
KIMS2021-05_
KIMS2021-06_
SRR1734374
SRR1734375
SRR1734376
SRR1734377
SRR6006895
SRR6006896
SRR6006897
SRR6006898
SRR6006899
SRR6006900
SRR6006901
SRR6006902
SRR6006903
SRR6006904
SRR6006905

Here i can see the SRR id become unique where as the KIIMS are not. So any suggestion or help how do i make them similar pattern to run it once.

The naive way is to run them as two different sets rather one but i would like to learn how to do when there are different kind or different length of naming

UPDATE

This code does what i want that is uniform name

for i in $(echo *.fastq*.gz); do echo ${i%_*}; done | uniq

Now i want to use it with rest of my command

 do STAR --runMode alignReads --outSAMtype BAM SortedByCoordinate --runThreadN 30 --genomeDir /run/media/punit/data3/Santosh_star_index --readFilesIn  <(gunzip -c ${i}_R1_001.fastq.gz ${i}_R2_001.fastq.gz ) --outFileNamePrefix ${i%};done

Now my issue is with i have 2 do that wont work but How do i pipe the name to the command

PesKchan
  • 868
  • 6
  • 14
  • 1
    You need remove last `_` and everything after it from file name. This can be achieved with `${i%_*}`. – Yuri Ginsburg Dec 20 '21 at 06:37
  • dumb question how will i achieve that just rename all my files name ? – PesKchan Dec 20 '21 at 06:44
  • 1
    As an aside, that's a [useless use of `ls`](https://www.iki.fi/era/unix/award.html#ls) – tripleee Dec 20 '21 at 06:51
  • for i in $(ls *.fastq*.gz | sed 's/00[0-9]\.gz/.gz/' | rev | cut -c 12- | rev | uniq); do echo ${i%_*}; done – PesKchan Dec 20 '21 at 06:52
  • @tripleee for biologist these simple stuffs are like grinding my head i try just enough to get it done – PesKchan Dec 20 '21 at 06:53
  • 1
    Using `printf` instead of `ls` is not more complex, it's less complex, and much less error-prone. In general, [never use `ls` in scripts.](https://mywiki.wooledge.org/ParsingLs), precisely because of all the complications. – tripleee Dec 20 '21 at 06:54
  • 1
    I'm not entirely sure what exactly your question is here; if the duplicate does not answer your question, perhaps try to [edit] this into a clearer and more focused question. If you do that, feel free to ping me here like @tripleee to get this reopened. – tripleee Dec 20 '21 at 06:56
  • for i in $(ls *.fastq*.gz | sed 's/00[0-9]\.gz/.gz/' | rev | cut -c 17- | rev | uniq); do STAR --runMode alignReads --outSAMtype BAM SortedByCoordinate --runThreadN 30 --genomeDir /run/media/punit/data3/Santosh_star_index --readFilesIn <(gunzip -c ${i}_R1.fastq.gz ${i}_R2.fastq.gz ) --outFileNamePrefix ${i%};done this is what i have to run let me try first then will ping – PesKchan Dec 20 '21 at 07:01
  • @tripleee Question in close resolution is about adding characters not removing. – Yuri Ginsburg Dec 20 '21 at 07:09
  • well i saw i can;t use do twice in that loop which i want to use – PesKchan Dec 20 '21 at 07:11
  • @YuriGinsburg The criteria for closing for duplicate is whether the _answers_ are substantially the same, not the questions. See e.g. https://meta.stackexchange.com/questions/194476/someone-flagged-my-question-as-already-answered-but-its-not/194495#194495 – tripleee Dec 20 '21 at 07:14
  • @YuriGinsburg i have added a comment to your answer ... – PesKchan Dec 20 '21 at 07:14
  • @tripleee it partially helped me i would be really glad if you can check my updates – PesKchan Dec 20 '21 at 08:11

1 Answers1

0

You can try something like this

for i in $(echo *.fastq*.gz); do echo ${i%_*}; done | uniq

edit Replaced ls with echo. @triplee is right, this is more reliable.

tripleee
  • 175,061
  • 34
  • 275
  • 318
Yuri Ginsburg
  • 2,302
  • 2
  • 13
  • 16
  • 1
    yeah i did that it worked i posted in my comment thank you for giving me a new trick to work with files – PesKchan Dec 20 '21 at 06:54
  • 1
    Glad that I could help. – Yuri Ginsburg Dec 20 '21 at 06:56
  • for i in $(ls *.fastq*.gz); do echo ${i%_*}; done | uniq; do STAR --runMode alignReads --outSAMtype BAM SortedByCoordinate --runThreadN 30 --genomeDir /run/media/punit/data3/Santosh_star_index --readFilesIn <(gunzip -c ${i}_R1.fastq.gz ${i}_R2.fastq.gz ) --outFileNamePrefix ${i%};done this is what i want to run but i cna;t use do twice which i saw now...how to fix? this – PesKchan Dec 20 '21 at 07:12
  • that part is sorted but the step where i want to pipe the file to the actual tool is im making a mistake – PesKchan Dec 20 '21 at 07:15
  • 1
    Also [don't read lines with `for`](http://mywiki.wooledge.org/DontReadLinesWithFor) – tripleee Dec 20 '21 at 08:14
  • `for i in $(echo *.fastq*.gz)` is wrong as well. The correct way is `for i in *.fastq*.gz`. – M. Nejat Aydin Dec 21 '21 at 00:00
  • @M.NejatAydin Is wrong? Does it produce incorrect result for file name formats specified by OP? – Yuri Ginsburg Dec 21 '21 at 01:29
  • 1
    It invokes a subshell unnecessarily and more importantly, it may produce wrong file list if a filename contains blank characters or glob characters. – M. Nejat Aydin Dec 21 '21 at 01:46
  • @M.NejatAydin OP provided two strict formats for input file names. Neither of them contains spaces. – Yuri Ginsburg Dec 21 '21 at 01:51
  • @PesKchan I think that problems with your pipeline are subject fo separate question. please post it and I will try to help. – Yuri Ginsburg Dec 21 '21 at 01:53
  • Then why did you replace `ls` with `echo`? There is no visible difference between `for i in $(echo *.fastq*.gz)` and `for i in $(ls *.fastq*.gz)`. – M. Nejat Aydin Dec 21 '21 at 02:18
  • @M.NejatAydin For two reasons: 1. another comment suggested that.2. `ls` invokes an external command, `echo` does not. – Yuri Ginsburg Dec 21 '21 at 02:24