0

I am having trouble using awk NR==FNR to return lines of interest from an input .fastq file.

I have the following example input file called example.fastq

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

I am trying to extract groups of four lines that contain a string of interest, importantly approximate matches must be allowed hence the use of agrep instead of grep. The below example works.

agrep -1 -n "GAAATAATA" example.fastq | awk -F: 'NR==FNR{for(i=($1-1);i<=($1+2);i++)a[i];next}FNR in a' - example.fastq

The above command produces the following correct output.

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

However if I use a sequence not contained in the second line this command still prints the top two lines as in the following example.

agrep -1 -n "TAGATAAAACT" example.fastq | awk -F: 'NR==FNR{for(i=($1-1);i<=($1+2);i++)a[i];next}FNR in a' - example.fastq

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

Thanks for helping me understand the behavior of this awk command.

Paul
  • 656
  • 1
  • 8
  • 23

3 Answers3

1

There are no colons (:) in your input, so $1 refers to the whole line and ($1-1) & ($2+2) are going to be -1 and 2 respectively, meaning your for loop will always run exactly four times (for values of i equal to -1, 0, 1, then 2).

Inside the for loop, you're assuring that a[i] exists (that's a[-1], a[0], a[1], and a[2]).

The final part of your code prints the line being examined at that time (but not from the first file thanks to the next in the prior stanza) whenever the array a contains an entry for that file's line number. Therefore, it prints lines 1 and 2 from each input (since a[FNR] exists for FNR equal to 1 or 2).

Since you need an approximate answer and therefore must use agrep, the idea proposed by James Brown's answer to your other question makes sense but its implementation (as dissected above) does not.

The following solution uses agrep's hits as cues for surrounding lines to print alongside the hits (agrep does not support context lines like grep's -A NUM and -B num or else we'd be able to do agrep -A1 -B2 -1 -n PATTERN example.fastq for a simpler answer).

agrep -1 "GAAATAATA" example.fastq | awk '
  NR == FNR { agrep_hit[$0] = 1; next }
  agrep_hit[$0] { print last_line; i = 1 }       
  0 < i && i < 4 { i++; print } 
  { last_line = $0 }
' - example.fastq

This examines the input file twice. The first time uses agrep to find the approximate pattern matches while the second uses awk to get the requested lines of context.

When the overall line number in awk (NR) equals the local file's line number (FNR), it means we're examining the first input (-, standard input, which is the output of agrep). We store the approximate pattern hits in an associative array for later and then move on to the next line with next (therefore the rest of the awk commands only operate on later inputs).

Since you need the previous line, we have to print it explicitly. The final stanza of awk code saves the current line as last_line so we can retrieve it later. Upon a line that was outputted by agrep (and thus stored in our array), we print that saved last_line and set an iterator i to 1.

When i is 1, 2, or 3, we increment it and print the current line. This prints the matching line and then two more for context.

Adam Katz
  • 14,455
  • 5
  • 68
  • 83
  • 1
    Thanks for this helpful breakdown of the awk command syntax. The use of agrep in the question is to allow mismatches that are not possible with grep. Unfortunately agrep does not have the -B (before) and -A (after) arguments hence the use of awk as a workaround. Please see this related post for more details on this issue. https://stackoverflow.com/questions/53768447/grep-that-tolerates-mismatches-to-subset-fastq – Paul Dec 17 '18 at 21:59
  • 1
    I have revised my solution to reflect the need for `agrep`'s approximate pattern match and included your link to the source of that code in my answer. Please include such references in your future questions. – Adam Katz Dec 17 '18 at 23:47
  • While I follow the logic and appreciate the explanation this solution is not generating any output for me. Lines 1 and 2 work `./agrep/agrep -1 -n "GAAATAATA" example.fastq | awk 'NR==FNR { agrep_hit[$0] = 1; print }' - example.fastq` but I am unable to get any output by adding additional lines. – Paul Dec 18 '18 at 15:39
  • 1
    Sorry, I was assuming your `agrep` command was more sane; you forced me to `apt install agrep`. Just remove the `-n` and it will work (the prefixed line numbers prevented matching lines from the first read of the file to lines from the second read of the file). I've updated my answer to that effect. – Adam Katz Dec 18 '18 at 17:36
1

You may use this agrep + awk solution:

srch() {
   awk -F ': ' 'NR==FNR {
      a[$1] = 1
      next
   }
   a[FNR] {
      print p
      print
      for (i=0; i<2 && getline > 0; i++)
         print
   }
   {
      p=$0
   }' <(agrep -1 -n "$2" "$1") "$1"
}

Then run it as:

srch file 'GAAATAATA'

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

and this:

srch file 'TAGATAAAACT

@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6'
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    I think you have a stray `r=NR;` in there. You never read it. – Adam Katz Dec 17 '18 at 23:43
  • Thanks but this solution doesn't facilitate approximate matches. Is there a way to build in approximate match functionality so that a single mismatch will still return a result? – Paul Dec 18 '18 at 14:59
  • My bad i should have made it more clear. I will edit to clarify. – Paul Dec 18 '18 at 15:09
0

with record separator definition (GNU awk)

$ awk -v RS='(^|\n)@' '/GAAATAATA/{printf "%s", rt $0} {rt=RT}' file

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
karakfa
  • 66,216
  • 7
  • 41
  • 56