13

I am trying to match rows in a file containing a string say ACTGGGTAAACTA. If I do

grep "ACTGGGTAAACTA" file 

It gives me rows which have exact matches. Is there a way to allow for certain number of mismatches (substitutions, insertions or deletions)? For example, I am looking for sequences

  1. Up to 3 allowed subtitutions like "AGTGGGTAACCAA" etc.

  2. Insertions/deletions (having a partial match like "ACTGGGAAAATAAACTA" or "ACTAAACTA")

tripleee
  • 175,061
  • 34
  • 275
  • 318
Ssank
  • 3,367
  • 7
  • 28
  • 34
  • Do you mean something like "find ACTGGGTAAACTA or sequences that changes up to 3 letters"? – Ramón Gil Moreno May 20 '15 at 17:03
  • 3
    Regex is not a fuzzy-match tool. You have to be very precise about what, exactly, you are looking for. You can explicitly declare that some character can be missing (for example, `ACTGGGTA{1,3}CTA` makes it possible to match `ACTGGGTACTA`, `ACTGGGTAACTA` and `ACTGGGTAAACTA`), but the more "fuzzy" you make your regex, the more undesired matches you'll end up with. – JDB May 20 '15 at 17:07
  • Maybe similar to [Fuzzy file search in linux console](http://stackoverflow.com/questions/9439121/fuzzy-file-search-in-linux-console) – emartinelli May 20 '15 at 18:10
  • possible duplicate of [Fuzzy regular expressions](http://stackoverflow.com/questions/4155840/fuzzy-regular-expressions) – tripleee Jun 12 '15 at 14:54

5 Answers5

5

There used to be a tool called agrep for fuzzy regex matching, but it got abandoned.

http://en.wikipedia.org/wiki/Agrep has a bit of history and links to related tools.

https://github.com/Wikinaut/agrep looks like a revived open source release, but I have not tested it.

Failing that, see if you can find tre-agrep for your distro.

tripleee
  • 175,061
  • 34
  • 275
  • 318
3

You can use tre-agrep and specify the edit distance with the -E switch. For example if you have a file foo:

cat <<< EOF > foo
ACTGGGAAAATAAACTA
ACTAAACTA
ACTGGGTAAACTA
EOF

You can match every line with an edit distance of up to 9 like this:

tre-agrep -s -9 -w ACTGGGTAAACTA foo

Output:

4:ACTGGGAAAATAAACTA
4:ACTAAACTA
0:ACTGGGTAAACTA
Thor
  • 45,082
  • 11
  • 119
  • 130
1

There's a Python library called fuzzysearch (that I wrote) which provides precisely the required functionality.

Here's some sample code that should work:

from fuzzysearch import find_near_matches

with open('path/to/file', 'r') as f:
    data = f.read()

# 1. search allowing up to 3 substitutions
matches = find_near_matches("ACTGGGTAAACTA", data, max_substitutions=3)

# 2. also allow insertions and deletions, i.e. allow an edit distance
#    a.k.a. Levenshtein distance of up to 3
matches = find_near_matches("ACTGGGTAAACTA", data, max_l_dist=3)
taleinat
  • 8,441
  • 1
  • 30
  • 44
1

You can use fzf to fuzzy-search for a string in the lines of a file as follows:

cat file | fzf --filter='ACTGGGTAAACTA'

The following will also work since you are redirecting the file to STDIN and fzf reads from it.

fzf --filter='ACTGGGTAAACTA' < file

In fact, you can also interactively see how fzf is filtering lines by launching its user interface as follows:

cat file | fzf

In the user interface, type some keywords(separated by spaces) to see filtering in action.

Remember the GNU/Linux philosophy, specifically the modularity concept, which enable us to handle small-but-powerful pieces independently. We can gather a bunch of these small pieces to make magic. That is beauty of GNU/Linux

Harsh Verma
  • 529
  • 6
  • 10
Rubem Pacelli
  • 332
  • 2
  • 12
0

Short answer: no.

Long answer: As @JDB said, regex is inherently precise. You can manually add in mismatches like [ATGC] instead of A in some spot, but there is no way to only allow a small amount of any mismatches. I suggest that you write your own code to parse it, or try to find a DNA parser somewhere.

Community
  • 1
  • 1
The Guy with The Hat
  • 10,836
  • 8
  • 57
  • 75