The following code works as a minimal example. It searches a regular expression with one mismatch inside a text (later a large DNA file).
awk 'BEGIN{print match("CTGGGTCATTAAATCGTTAGC...", /.ATC|A.TC|AA.C|AAT./)}'
Later I am interested in the position where the regular expression is found. Therefore the awk command is more complex. Like it is solved here
If I want to search with more mismatches and a longer string I will come up with very long regex expressions:
example: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" with 3 mismatches "." allowed:
/
...AAAAAAAAAAAAAAAAAAAAAAAAAAA|
..A.AAAAAAAAAAAAAAAAAAAAAAAAAA|
..AA.AAAAAAAAAAAAAAAAAAAAAAAAA|
-
- and so on. (actually 4060 possibilities)
/
The problem with my solution is:
- very long regex will not be accepted by awk! (limit seems to be at roughly about 80.000 characters)
- Error: "bash: /usr/bin/awk: Argument list too long"
- possible solution: SO-Link but I don't find the solution...
My question is:
- Can I somehow still use the long regex expression?
- splitting the string and running the command multiple times could be a solution, but then I will get duplicated results.
- Is there another way to approach this?
- ("agrep" will work, but not to find the positions)