0

I am using awk to count the length of reads in a directory of FASTQ files. I am using the implementation suggested here. What it does is list read length and the number of occurrences.

I would like to implement this in a loop like so:

  for i in $( ls ./Raw_data); do
      awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}' <(gzip -dc "./Raw_data/"$i) 
  done

However while doing this I would like to specify which file the counts come from in a table. I would therefore like to print the name of the file with each awk print statement.

I have tried:

awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print $i,  l, lengths[l]}}' <(gzip -dc "./Raw_data/"$i)

awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print FILENAME,  l, lengths[l]}}' <(gzip -dc "./Raw_data/"$i)

but these both fail. I think this is due to the piped input.

How can I achieve this?

G_T
  • 1,555
  • 1
  • 18
  • 34
  • 1
    See https://stackoverflow.com/questions/6697753/difference-between-single-and-double-quotes-in-bash. Also, use `for i in Raw_data/*`; it handles file names with spaces and includes `Raw_data/` in each name (which is usually what you want). – Davis Herring Sep 21 '17 at 02:44
  • There is no piped input so that can't be the problem. Have you tried running it on one file and adding a print statement to debug the problem? You haven't told us in what way `these both fail` nor have you provided sample input/output we could test against so there's not much we can do towards helping you figure out why. btw with `"./Raw_data/"$i` you're quoting the part that doesn't need to be quoted (`./Raw_data/`) and not quoting the part that does need to be quoted (`$i`). Just use `"./Raw_data/$i"`. – Ed Morton Sep 21 '17 at 02:49
  • 1
    I would add `awk -v fName=./Raw_data/"$i" '{awk code}END {print fName, l, lengths[l]}}' <(gzip ...... ./Raw_data/"$i")` . Or there abouts. Good luck. – shellter Sep 21 '17 at 03:07

0 Answers0