awk unix - match regex - regex string size limit | ideas?

Question

The following code works as a minimal example. It searches a regular expression with one mismatch inside a text (later a large DNA file).

awk 'BEGIN{print match("CTGGGTCATTAAATCGTTAGC...", /.ATC|A.TC|AA.C|AAT./)}'

Later I am interested in the position where the regular expression is found. Therefore the awk command is more complex. Like it is solved here

If I want to search with more mismatches and a longer string I will come up with very long regex expressions:

example: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" with 3 mismatches "." allowed:
/
...AAAAAAAAAAAAAAAAAAAAAAAAAAA|
..A.AAAAAAAAAAAAAAAAAAAAAAAAAA|
..AA.AAAAAAAAAAAAAAAAAAAAAAAAA|
-
- and so on. (actually 4060 possibilities)

/

The problem with my solution is:

very long regex will not be accepted by awk! (limit seems to be at roughly about 80.000 characters)
Error: "bash: /usr/bin/awk: Argument list too long"
possible solution: SO-Link but I don't find the solution...

My question is:

Can I somehow still use the long regex expression?
- splitting the string and running the command multiple times could be a solution, but then I will get duplicated results.
Is there another way to approach this?
- ("agrep" will work, but not to find the positions)

Welcome to SO, kudos for a nice question(with all your efforts and research shown in it, keep it up). Only one thing could you please post samples of input and expected output too(more clearly) so that we get better understanding of your question, thank you. — RavinderSingh13, May 10 '21 at 09:21
I'd suggest to ask this on https://bioinformatics.stackexchange.com/ since there is a possiblity that there is some tool that can already do this, for example `biopython` — Sundeep, May 10 '21 at 09:27
Is the _very long regex_ identical to the one you presented in the code, `(.ATC|A.TC|AA.C|AAT.)`? In that case I'd look into implementing an _approximate pattern matching_ algorithm, using AATC as a search pattern and allowing it substitutions only. — James Brown, May 10 '21 at 09:35
I would perhaps look into using a different programming language with slightly higher-level logic. A Python script with a Levinshtein distance calculation would just need the input string and the maxiumum number of allowed deviations, and could easily be extended to provide information about the file, the line, and the position of the match. — tripleee, May 10 '21 at 09:38
Thanks! @JamesBrown - yes in "agrep" for example this works. But awk does not support it!? — Lucas, May 10 '21 at 09:47
https://github.com/Wikinaut/agrep documents a `-b` option to reveal where in the text the match begins. — tripleee, May 10 '21 at 09:49
https://stackoverflow.com/questions/6690739/high-performance-fuzzy-string-comparison-in-python-use-levenshtein-or-difflib should at least get you started on a Python solution. — tripleee, May 10 '21 at 09:50
Possible duplicate of https://stackoverflow.com/questions/30355972/fuzzy-string-matching-with-grep — tripleee, May 10 '21 at 09:52
Awk does not have any built-in APM algorithms. You need to implement it yourself or google one from the internets. — James Brown, May 10 '21 at 09:58
agrep with `-b` like @tripleee suggested is a possible solution. But I will get the overall position in the file. But I need the position in every line... — Lucas, May 10 '21 at 10:59
Note that the error `"bash: /usr/bin/awk: Argument list too long"` is from Bash (the shell) and not from `awk`. One workaround for 'argument list too long' would be to place the Awk script into a file (e.g. `script.awk`) and then use `awk -f script.awk …` to run it. That evades the limits on the command line in Bash. Nevertheless, your approach is probably sub-optimal. — Jonathan Leffler, May 11 '21 at 22:01
Thank you @JonathanLeffler this helped and is a real answer! Though it will run in a 137 Error at some point (memory related maybe). — Lucas, May 14 '21 at 12:47
agrep -b works not perfect in such long documents. I have another SO [Here](https://stackoverflow.com/questions/67499894/agrep-fuzzy-search-number-of-results-does-not-make-sense) My solution is now to use an R aregexec() command which is okay for me at this moment. Though a only unix based solution would be still interesting... — Lucas, May 14 '21 at 12:52

dawg · Accepted Answer · 2021-05-24T16:40:34.363

As Jonathan Leffler points out in comments your issue in the first case (bash: /usr/bin/awk: Argument list too long) is from the shell and you can solve that by putting your awk script in a file.

As he also points out, your fundamental approach is not optimal. Below are two alternatives.

Perl has many features that will aid you with this.

You can use the ^ XOR operator on two strings that will return \x00 where the strings match and another character where they don't match. March through the longer string XORing against the shorter with a max substitution count and there you are:

use strict;
use warnings;
use 5.014;

my $seq = "CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT";
my $pat     = "AAAAAA";
my $max_subs = 3;

my $len_in  = length $seq;
my $len_pat = length $pat;
my %posn;

sub strDiffMaxDelta {
    my ( $s1, $s2, $maxDelta ) = @_;
    
    # XOR the strings to find the count of differences
    my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
    return $diffCount <= $maxDelta;
}

for my $i ( 0 .. $len_in - $len_pat ) { 
    my $substr = substr $seq, $i, $len_pat;
    # save position if there is a match up to $max_subs substitutions
    $posn{$i} = $substr if strDiffMaxDelta( $pat, $substr, $max_subs );
}

say "$_ => $posn{$_}" for sort { $a <=> $b } keys %posn;

Running this prints:

6 => AATCCA
9 => CCAGAA
10 => CAGAAC
11 => AGAACG
13 => AACGCA
60 => CAATCA
61 => AATCAA
62 => ATCAAA
63 => TCAAAT

Substituting:

$seq=AAATCGAAAAGCDFAAAACGT;
$pat=AATC;
$max_subs=1;

Prints:

1 => AATC
8 => AAGC
15 => AAAC

It is also easy (in the same style as awk) to convert this to 'magic input' from either stdin or a file.

You can also write a similar approach in awk:

echo "AAATCGAAAAGCDFAAAACGT" | awk -v mc=1 -v seq="AATC" '
{
for(i=1; i<=length($1)-length(seq)+1; i++) {
    cnt=0
    for(j=1;j<=length(seq); j++) 
        if(substr($1,i+j-1,1)!=substr(seq,j,1)) cnt++
    if (cnt<=mc) print i-1 " => " substr($1,i, length(seq)) 
    }
}'

Prints:

1 => AATC
8 => AAGC
15 => AAAC

And the same result with the longer example above. Since the input is moved to STDIN (or a file) and the regex does not need to be HUGE, this should get you started either with Perl or Awk.

(Be aware that the first character of a string is offset 1 in awk and offset 0 in Perl...)

One question for the awk approach: I think a "+1" is missing here `... i <= length($1) - length(seq); ...` ? — Lucas, May 24 '21 at 09:27
@Lucas: *I think a "+1" is missing here*: Having arrays and strings start at 1 instead of 0 does cause that error often! Having a second look I identified two bugs based on that: a match at there very start and one at the very end. Corrected. — dawg, May 24 '21 at 16:41

Wiktor Stribiżew · Answer 2 · 2021-05-20T07:31:27.847

0

Is there another way to approach this?

Looking for fuzzy matches is easy with Python. You just need to install the PyPi regex module by running the following in the terminal:

pip install regex # or pip3 install regex

and then create the Python script (named, say, script.py) like

#!/usr/bin/env python3
import regex
filepath = r'myfile.txt'
with open(filepath, 'r') as file:
    for line in file:
        for x in regex.finditer(r"(?:AATC){s<=1}", line):
            print(f'{x.start()}:{x.group()}')

Use the pattern you want, here, (?e)(?:AATC){s<=1} means you want to match AATC char sequence allowing one substitution at most in the match, with (?e) attempting to find a better fit.

Run the script using python3 script.py.

If myfile.txt contains just one AAATCGAAAAGCDFAAAACGT line, the output is

1:AATC
8:AAGC
15:AAAC

meaning that there are three matches at positions 1 (AATC), 8 (AAGC) and 15 (AAAC).

You can get the values themselves by replacing x.start() with x.group() in the Python script.

See an online Python demo:

import regex
line='AAATCGAAAAGCDFAAAACGT'
for x in regex.finditer(r"(?:AATC){s<=1}", line):
    print(f'{x.start()}:{x.group()}')

edited May 20 '21 at 07:31

answered May 19 '21 at 12:40

Wiktor Stribiżew

607,720
39
448
563

@Lucas It is much more powerful than you see now, if you follow the details at the PyPi regex page, you will see you can use more sophisticated rules with the `{...}` fuzzy quantifier, e.g. you can specify the number of any errors (e.g. `{e<=3}` to specify up to 3 insertions/deletions/substitutions), or a variable amount of substitutions (like `{3<=s<=1}` to allow 1 to 3 substitutions), you can specify the pattern to allow fuzzy matches on inside the match, etc. – Wiktor Stribiżew May 24 '21 at 09:27
Yes, **if** I will use python this is the way! I also found an R function to solve it. aregexec() Though the question was also to see if a pure unix approach works. – Lucas May 24 '21 at 09:35
@Lucas I believe `aregexec` in R uses the default base R `TRE` regex engine, which is very old, outdated and contains lots of limitations and bugs. I'd rather vote for the PyPi regex Python approach since this module is still maintained and bugfixed, and will probably replace the default Python `re` engine one day. – Wiktor Stribiżew May 24 '21 at 09:37

Kaz · Answer 3 · 2021-05-19T21:31:59.413

The "Argument list too long" problem is not from Awk. You're running into the operating system's memory size limit on the argument material that can be passed to a child process. You're passing the Awk program to Awk as a very large command line argument.

Don't do that; put the code into a file, and run it with awk -f file, or make the file executable and put a #!/usr/bin/awk -f or similar hash-bang line at the top.

That said, it's probably not such such great idea to include your data in the program source code as a giant literal.

awk unix - match regex - regex string size limit | ideas?

3 Answers3