3

I created this little Bash script that has one argument (a filename) and the script is supposed to respond according to the extension of the file:

#!/bin/bash

fileFormat=${1}

if [[ ${fileFormat} =~ [Ff][Aa]?[Ss]?[Tt]?[Qq]\.?[[:alnum:]]+$ ]]; then
    echo "its a FASTQ file";
elif [[ ${fileFormat} =~ [Ss][Aa][Mm] ]]; then
    echo "its a SAM file";
else
    echo "its not fasta nor sam";
fi

It's ran like this:

sh script.sh filename.sam

If it's a fastq (or FASTQ, or fq, or FQ, or fastq.gz (compressed)) I want the script to tell me "it's a fastq". If it's a sam, I want it to tell me it's a sam, and if not, I want to tell me it's neither sam or fastq.

THE PROBLEM: when I didn't consider the .gz (compressed) scenario, the script ran well and gave the result I expected, but something is happening when I try to add that last part to account for that situation (see third line, the part where it says .?[[:alnum:]]+ ). This part is meant to say "in the filename, after the extension (fastq in this case), there might be a dot plus some word afterwards".

My input is this:

sh script.sh filename.fastq.gz

And it works. But if I put: sh script.sh filename.fastq

It says it's not fastq. I wanted to put that last part as optional, but if I add a "?" at the end it doesn't work. Any thoughts? Thanks! My question would be to fix that part in order to work for both cases.

msimmer92
  • 397
  • 3
  • 16
  • sorry, i just edited the question. now you can see it – msimmer92 Jan 08 '19 at 15:12
  • try changing `\.?[[:alnum:]]+` to `(?:\.[[:alnum:]]+)?` – Matt.G Jan 08 '19 at 15:13
  • 1
    sorry, see the new edit. sorry for the inconvenience, I accidentally submitted the question before I finished the post and I had to finish it afterwards with an edit. – msimmer92 Jan 08 '19 at 15:15
  • 1
    What about using [`file(1)`](https://man.freebsd.org/file), rather than the name? – ghoti Jan 08 '19 at 15:15
  • Use `shopt -s nocasematch` for case insensitive regex match instead of using `[Ff]` – Inian Jan 08 '19 at 15:18
  • @Matt.G it still tells me "it's not a fastq" therefore it didn't work :/ – msimmer92 Jan 08 '19 at 15:19
  • @msimmer92, not sure why it didn't work, as the accepted answer has the same fix (unless, I'm missing something obvious) – Matt.G Jan 08 '19 at 17:03
  • @Matt.G take a closer look to your expression and the one in the accepted answer. Yours has, at the beginning, additional "?:" that the other doesn't have. Maybe try both to see if that is making it crash. – msimmer92 Jan 09 '19 at 14:38
  • @msimmer92, `?:` denotes a non-capturing group. It looks like bash doesn't support it. – Matt.G Jan 09 '19 at 14:41
  • @Matt.G I left out the fact that I'm working on Mac terminal (OS Mojave). Maybe you're working on Linux and the Bash regex patterns vary a little? (I'm trying to give ideas of why this may have worked for you but not for me). – msimmer92 Jan 09 '19 at 14:44

2 Answers2

4

You may use this regex:

fileFormat="$1"

if [[ $fileFormat =~ [Ff]([Aa][Ss][Tt])?[Qq](\.[[:alnum:]]+)?$ ]]; then
    echo "its a FASTQ file"
elif [[ $fileFormat =~ [Ss][Aa][Mm]$ ]]; then
    echo "its a SAM file"
else
    echo "its not fasta nor sam"
fi

Here (\.[[:alnum:]]+)? makes last group optional which is dot followed by 1+ alphanumeric characters.

When you run it as:

./script.sh filename.fastq
its a FASTQ file

./script.sh fq
its a FASTQ file

./script.sh filename.fastq.gz
its a FASTQ file

./script.sh filename.sam
its a SAM file

./script.sh filename.txt
its not fasta nor sam
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    This is what I was looking for (because it fixes the specific problem of my code). Many thanks !! :) – msimmer92 Jan 08 '19 at 15:26
  • This still looks for "sam" anywhere in the filename, doesn't it? So "samba.txt" would match. – tripleee Jan 08 '19 at 15:57
  • It is because OP's pattern checks for `sam` anywhere in filename and I can't change it without knowing all the `sam` use-cases. – anubhava Jan 08 '19 at 16:34
  • 1
    if you put $ at the end ([Ss][Aa][Mm]$) it works for the case that sam is followed by the end of the word (therefore not accounting for the situations where it's in the middle of the filename) – msimmer92 Jan 08 '19 at 16:42
  • 2
    Since it's either "fastq" or "fq", shouldn't it rather be `[Ff]([Aa][Ss][Tt])?[Qq]`? – Benjamin W. Jan 08 '19 at 16:45
  • 2
    Very valid point @BenjaminW., otherwise it will mark `filename.ftq` also as `FASTQ` file. It is edited now, thanks. – anubhava Jan 08 '19 at 16:48
1

The immediate problem is that you are requiring at least one [[:alnum:]] character after .fastq. This is easy to fix per se with * instead of +.

Regex is not a particularly happy solution to this problem, though.

case $fileFormat in
    *.[Ff][Aa][Ss][Tt][Qq] | *.[Ff][Aa][Ss][Tt][Qq].*)
        echo "$0: $fileFormat is a FASTQ file" >&2 ;;
    *.[Ss][Aa][Mm] )
        echo "$0: $fileFormat is a SAM file" >%2 ;;
esac

is portable all the way back to the original Bourne sh. In Bash 4.x you could lowercase the filename before the comparison so as to simplify the glob patterns.

Notice also how the diagnostics contain the name of the script and print to standard error instead of standard output.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • 2
    Nice portable idea, but OP also needs to match `.FQ` or `.fq` also, don't they? – Inian Jan 08 '19 at 15:20
  • 1
    Yes, but this accounts for .FQ and .fq as well. I like it , but it would also recognize the fastq in some part of the filename and not necessarily in the last position of the word. That's why I was trying to use the other way (by also putting $ at the end and stuff and not letting it capture whatever it comes next). But for this simple script is a nice solution – msimmer92 Jan 08 '19 at 15:22