0

I have a couple of files, which look like this:

1_150901_AC7GLHANXX_P2258_101_1.fastq.gz
1_150901_AC7GLHANXX_P2258_101_2.fastq.gz
2_150901_AC7GLHANXX_P2258_101_1.fastq.gz
2_150901_AC7GLHANXX_P2258_101_2.fastq.gz

... i.e., there are two files that start with 1_ and end in either _1.fastq.gz or _2.fastq.gz, and the same for two files that start with 2_. What I want to do is to cat the two files ending in _1.fastq.gz, like this:

cat 1_150901_AC7GLHANXX_P2258_101_1.fastq.gz \ 
    2_150901_AC7GLHANXX_P2258_101_1.fastq.gz \
    > 150901_AC7GLHANXX_P2258_101_1.fastq.gz

... so that they are merged and have their prefix removed. I have a lot more files in a lot more folders than this, so I want to automate it. I tried the following code, to no avail:

for f in *_*_1.fastq.gz
do
    cat $f "${f/^1_/2_}" > "${f/^1_/}"
done

I don't think I know this replacement-method well enough, but it's what I have used in the past for less complicated filenames (when they only have different sufficex, and no prefix). I think that the ^ at the beginning signifies start of the filename, but it doesn't seem to work like I want it to, so obviously I'm doing something wrong. I tried doing some troubleshooting:

for f in *_*_1.fastq.gz
    do
        echo "${f/^1_/}"
    done

... gives me ...

1_150901_AC7GLHANXX_P2258_101_1.fastq.gz
2_150901_AC7GLHANXX_P2258_101_1.fastq.gz

... which is not what I thought it would be. Does anybody know how I could do this?

[Edit, clarify non-duplicated question]

This question is different from my previous question in that I also have a prefix for the filenames, and that prefix also exists in the middle of the filename. The other question had a simpler case, where only a suffix was what needed to be renamed.

erikfas
  • 4,357
  • 7
  • 28
  • 36
  • In this context, the beginning of string anchor is the `#` character, not `^`. See the [relevant section of the manual](http://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion) (you have to scroll to the relevant part of the section that covers the `${parameter/pattern/string}` expansion). – gniourf_gniourf Sep 15 '15 at 08:16
  • Ah, okay! Yeah, the `#` does work in this context, thanks! – erikfas Sep 15 '15 at 09:05

2 Answers2

2

find the "1"s then check for the "2"s if there are both cat them together and delete the parts.

for f in 1_*_1.fastq.gz
do
      g="2_${f#1_}"
      if [ -f "$g" ]
      then
            cat "$f" "$g" > "${f#1_}" && rm "$f" "$g"
      fi
done
Jasen
  • 11,837
  • 2
  • 30
  • 48
  • That did it, thanks a lot! I also had to change the for loop to `1_*_*.fastq.gz` because I was only getting the `_1.fastq.gz` files and not the `_2.fastq.gz`, but that was my initial fault in the question. – erikfas Sep 15 '15 at 09:05
0

If I were in y our position, assuming there's only files with this format in the directory, I'd go for a procedure like this:

$ ls | cut -b3- | sort -u | tee stems.lst # list the stems
$ while read stem; do cat *_$stem > $stem; done <stems.lst

Try this in a test directory before you go live, else you'll make a mess of the filenames and it'll be a pain to recover.

Closing notes:

  • trick: it's a bit inconvenient here because of the redirections, but it's safer to try out the while command in non-destructive mode first, by running some form of echo "cat *_$stem > $stem" before replacing it with the real thing.
  • don't forget to remove stem.lst afterwards
  • if it's stable and you need to repeat, you can pipe the stem list directly from sort -u to while
  • (if this question is useful to anyone else in the same situation) if your filenames contain anything weird, put double quotes around $stem everywhere in the while line
JB.
  • 40,344
  • 12
  • 79
  • 106