Grep that tolerates mismatches to subset .fastq

Question

I am working with bash on a linux cluster. I am trying to extract reads from a .fastq file if they contain a match to a queried sequence. Below is an example .fastq file containing three reads.

$ cat example.fastq

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

I would like to extract reads containing the sequence GAAATAATA. I can perform this extraction using grep as shown in the following command.

$ grep -F -B 1 -A 2 "GAAATAATA" example.fastq > MATCH.fastq

$ cat MATCH.fastq

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

However, this strategy does not tolerate any mismatches. For example, a read containing the sequence GAAATGATA will be ignored. I need this extraction to tolerate one mismatch at any position in the queried sequence. So my question is how can I achieve this? Is there a sequence alignment package available with similar functionality to grep? Are there any fastq subsetting packages available that perform this type of operation? One note is that speed is very important. Thanks for your guidance.

Well, there is `agrep` with _approximate matching capabilities_: `agrep -1 "GAAATGATA"` `-1` allows one mismatch. Unfortunately I don't think it has switches for after/before context (`grep -A -B`). — James Brown, Dec 13 '18 at 19:17
Can the mismatch be any character? I'm assuming it will be `[GATC]`, but do we need to specify that, or is it safe to make it an assumption? — Paul Hodges, Dec 13 '18 at 20:22
Have you tried to do an alignment against an index that only contains the sequence you are looking? Maybe standard aligners can take care of this scenario. — Poshi, Dec 13 '18 at 21:44
@Poshi, this is something I was considering doing the complication in my mind that I need to get a .fastq formatted file of matches as output. Do you know if this is possible with and aligner like STAR? — Paul, Dec 13 '18 at 22:13
STAR is RNA oriented, if I'm not wrong. Why don't you try with a standard DNA one? Bowtie, BWA... in BWA you just need to create the index: `bwa index reference.fasta` and then perform the alignment `bwa mem index fastq`. I'm just worried about how it will deal with a so short reference, but trying will be quick. — Poshi, Dec 13 '18 at 22:19

score 2 · Accepted Answer · answered Dec 13 '18 at 19:41

2

Here is a solution using agrep to get the record numbers of matches and an awk that prints out those records with some context (due to missing -Aand -B in agrep):

$ agrep -1 -n  "GAAATGATA" file | 
  awk -F: 'NR==FNR{for(i=($1-1);i<=($1+2);i++)a[i];next}FNR in a' - file

Output:

@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

answered Dec 13 '18 at 19:41

James Brown

36,089
7
43
59

I have identified a problem with this solution that i was hoping you might be willing to comment on. This solution will output the first two lines even when there is no match. Here is an example `agrep -1 -n "TAGATAAAACT" example.fastq | awk -F: 'NR==FNR{for(i=($1-1);i<=($1+2);i++)a[i];next}FNR in a' - example.fastq` . I have a very beginner level understanding of awk and have been trying to understand why this is happening. – Paul Dec 17 '18 at 20:24
Sure there is a match, try: `grep -n "TAGATAAAACT" example.fastq` – James Brown Dec 18 '18 at 05:14
To clarify the `grep -n "TAGATAAAACT" example.fastq` returns a match in line 10. But the solution returns lines 1,2,9,10,11,12. Lines 1 and 2 should not be returned – Paul Dec 18 '18 at 14:52
`agrep -1 -n "TAGATAAAACT" file` returns me only line 10 and `agrep -1 -n "TAGATAAAACT | awk ...` returns lines 9-12. Tried with 4 different awks and unix and Dos line endings. – James Brown Dec 18 '18 at 16:58

Paul Hodges · Answer 2 · 2018-12-18T17:33:49.560

You might try a file of patterns -

$: cat GAAATAATA
.AAATAATA
G.AATAATA
GA.ATAATA
GAA.TAATA
GAAA.AATA
GAAAT.ATA
GAAATA.TA
GAAATAA.A
GAAATAAT.

then

grep -B 1 -A 2 -f GAAATAATA example.fastq > MATCH.fastq

but it will probably slow the process down a bit to add both full regex parsing AND an alternate pattern for each possible single change...

responding to question in comments:

For a given value of $word, such as word=GAAATAATA,

awk '{
  for ( i=1; i<=length($0); i++ ) {
     split($0,tmp,""); tmp[i]=".";
     for ( n=1; n<=length($0); n++ ) { printf tmp[n]; }
     printf "\n";
  }
}' <<< "$word" > "$word"

This will create this specific file. Hope that helps, but remember that this will be a lot slower since you are now using regexes instead of just matching plain strings, AND you are introducing a whole series of alternate patterns to match...

This is an interesting idea do you have any suggestions on how to create the permutations described using bash or awk? — Paul, Dec 18 '18 at 16:36
maybe a for loop with `echo GAAATAATA | sed 's/^$.\{1\}$/\./'` — Paul, Dec 18 '18 at 17:09

Ed Morton · Answer 3 · 2018-12-14T00:38:22.880

This should work but idk if the MATCH.fastq in your question is the expected output or not or even if your sample input contains any cases that need a working solution to find so idk if it's actually working or not:

$ cat tst.awk
BEGIN {
    for (i=1; i<=length(seq); i++) {
        regexp = regexp sep substr(seq,1,i-1) "." substr(seq,i+1)
        sep = "|"
    }
}
{ rec = rec $0 ORS }
!(NR % 4) {
    if (rec ~ regexp) {
        printf "%s", rec
    }
    rec = ""
}

$ awk -v seq='GAAATAATA' -f tst.awk example.fastq
@SRR1111111.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
@SRR1111111.3 3/1
CTATANTATTGAAATAATAATGTAGATAAAACTATTGAATAACAGCAACTTAAATTTTCAATAAGA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6

Grep that tolerates mismatches to subset .fastq

3 Answers3

Linked