-1

here's my issue, I have a bunch of fastq.gz files and I need to determine the number of lines of it (this is not the issue), and from that number of line derive a value that determine a threshold used as a variable used down in the same loop. I browsed but cannot find how to do it. here's what I have so far:

for file in *R1.fastq*; do
var=echo $(zcat "$file" | $((`wc -l`/400000))) 
    for i in *Bacter*; do 
    awk -v var1=$var '{if($2 >= var1) print $0}' ${i} | wc -l >> bacter-filtered.txt 
    done 
done 

I get the error message: -bash: 14850508/400000: No such file or directory

any help would be greatly appreciated !

glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • I added some formatting. There are some shell errors here. Can you ensure I didn't change your code snippet? What will help you is https://shellcheck.net – glenn jackman Sep 28 '21 at 00:16
  • do you need `wc -l/400000` to evaluate to an integer or a real/float, and if the latter how many decimal places? do the rows being written to `bacter-filtered.txt` need to be maintained in any particular order? how many `fastq` files are there? how many `*Bacter*` files are there and what is the total size (MBytes) of said files? I'm thinking a small redesign of the overall process could ensure we only scan each `*Bacter*` file once ... – markp-fuso Sep 28 '21 at 00:26

1 Answers1

1

The problem is in the line

var=echo $(zcat "$file" | $((`wc -l`/400000))) 

There are a bunch of shell syntax elements here combined in ways that don't connect up with each other. To keep things straight, I'd recommend splitting it into two separate operations:

lines=$(zcat "$file" | wc -l)
var=$((lines/400000))

(You may also have to do something about the output to bacter-filtered.txt -- it's just going to contain a bunch of numbers, with no identifications of which ones come from which files. Also since it always appends, if you run this twice you'll have the output from both runs stuck together. You might want to replace all those appends with a single > bacter-filtered.txt after the last done, so the whole output just gets stored directly.)

What's wrong with the original? Well, let's start with this:

zcat "$file" | $((`wc -l`/400000))

Unless I completely misunderstand, the purpose here is to extract $file (with zcat), count lines in the result (with wc -l), and divide that by 400000. But since the output of zcat isn't piped directly to wc, it's piped to a complex expression involving wc, it's somewhat ambiguous what should happen, and is actually different under different shells. In zsh, it does something completely different from that: it lets wc read from the script's stdin (generally your Terminal), divides the result from that by 400000, and then pipes the output of zcat to that ... number?

In bash, it does something closer to what you want: wc actually does read from the output of zcat, so the second part of the pipe essentially turns into:

... | $((14850508/400000))

Now, what I'd expect to happen at this point (and happens in my tests) is that it should evaluate $((14850508/400000)) into 37, giving:

... | 37

which will then try to execute 37 as a command (because it's part of a pipeline, and therefore is supposed to be a command). But for some reason it's apparently not evaluating the division and just trying to execute 14850508/400000 as a command. Which doesn't really work any better or worse than 37, so I guess it doesn't matter much.

So that's where the error is coming from, but there's actually another layer of confusion in the original line. Suppose that internal pipeline was fixed so that it properly output "37" (rather than trying to execute it). The outer structure would then be:

var=echo $(cmdthatprints37)

The $( ) basically means "run the command inside, and substitute its output into the command line here", so that would evaluate to:

var=echo 37

...which, in shell syntax, means "run the command 37 with var set to "echo" in its environment.

The solution here would be simple. The echo is messing everything up so remove it:

var=$(cmdthatprints37)

...which evaluates to:

var=37

...which is what you want. Except that, as I said above, it'd be better to split it up and do the command bits and the math separately rather than getting them mixed up.

BTW, I'd also recommend some additional double-quoting of shell variables; shellcheck.net will be happy to point out where.

Gordon Davisson
  • 118,432
  • 16
  • 123
  • 151
  • Hi All, Thank you very much for all the explanation ! I am very new to bash and coding in general, and it is always very nice to have the breakdown of how things and are designed. I'll give it a try ! – desplat yvain Sep 28 '21 at 19:16