2

I want to use awk to match all the occurrences of a pattern within a large file. For each match, I would like to print the row number and the starting position of the pattern along the row (sort of xy coordinates). There are several occurrences of the pattern in each line. I found this somewhat related question.

So far, I managed to do it only for the first (leftmost) occurrence in each line. As an example:

echo xyzABCdefghiABCdefghiABCdef | awk 'match($0, /ABC/) {print NR, RSTART } ' 

The resulting output is :

1 4

But what I would expect is something like this:

1 4
1 13
1 22

I tried using split instead of match. I manage to identify all the occurrences, but the RSTART is lost and printed as "0".

echo xyzABCdefghiABCdefghiABCdef | awk ' { split($0,t, /ABC/,m) ; for (i=1; i in m; i++) print (NR, RSTART) } '

Output:

1 0
1 0
1 0

Any advice would be appreciated. I am not limited to using awk but a awk solution would be appreciated. Also, in my case the pattern to match would be a regex (/A.C/). Thank you

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
RicGGG
  • 23
  • 3
  • how to deal with overlaps? eg, assume the input is `01010` and you're looking for `010` ... 1 or 2 matches? – markp-fuso Jan 25 '22 at 22:15
  • In my current situation I am sure I have no overlaps in the file. Hypothetically I think it would be best to have a way to deal with overlaps as two separate matches. – RicGGG Jan 26 '22 at 05:38
  • Please read [how-do-i-find-the-text-that-matches-a-pattern](https://stackoverflow.com/questions/65621325/how-do-i-find-the-text-that-matches-a-pattern) and then replace the word "pattern" with string or regexp everywhere it occurs in your question. You should also use regexp metachars in your search "pattern" and include those in your sample input/output so we have something we can really test with. – Ed Morton Jan 26 '22 at 12:21
  • Also include overlapping strings in your example like search for `ABCABC` in the string `ABCABCABCABC` to show if the output should be `1 1\n1 7` or `1 1\n1 4\n1 7` or something else. As written you could get answers that produce the expected output from the sample input but don't actually do whatever it is you need. – Ed Morton Jan 26 '22 at 12:25

5 Answers5

2

This may be what you're trying to do:

echo xyzABCdefghiABCdefghiABCdef | 
awk '{ begpos=1
       while (match(substr($0, begpos), /ABC/)) {
           print NR, begpos + RSTART - 1
           begpos += RLENGTH + RSTART - 1
       }
     }'
M. Nejat Aydin
  • 9,597
  • 1
  • 7
  • 17
2

Another option using gnu awk could be using split with a regex.

Using the split function, the 3rd field is the fieldsep array and the 4th field is the seps array which you can both use to calculate the positions.

echo xyzABCdefghiABCdefghiABCdef | 
awk ' { 
  n=split($0, a, /ABC/, seps); pos=1
  for(i=1; i<n; i++){
    pos += length(a[i])
    print NR, pos
    pos += length(seps[i])
  } 
}'

Output

1 4
1 13
1 22
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

Determination of the coordinates of a string with awk:

echo "xyzABCdefghiABCdefghiABCdef" \
  | awk -v s="ABC" 'BEGIN{ len=length(s) }
      {
        for(i=1; i<=length($0); i++){
          if(substr($0, i, len)==s){
            print NR, i
          }
        }
      }'

Output:

1 4
1 13
1 22

As one line:

echo xyzABCdefghiABCdefghiABCdef | awk -v s="ABC" 'BEGIN{ len=length(s) } { for(i=1; i<=length($0); i++){ if(substr($0,i,len)==s) { print NR,i } } }'

Source: Find position of character with awk

Cyrus
  • 84,225
  • 14
  • 89
  • 153
  • This works very well in the example I posted, but it fails to find matches when my pattern is a regular expression (such as /A.C/ ). I guess it just needs a small tweak for that. – RicGGG Jan 26 '22 at 09:03
1

One awk idea using split() and some slicing-n-dicing of length() results:

ptn='ABC'

echo xyzABCdefghiABCdefghiABCdef | 
awk -v ptn="${ptn}" '
{ pos=-(length(ptn)-1)
  n=split($0,arr,ptn)
  for (i=1;i<n;i++) { 
      pos+=length(arr[i] ptn)
      print NR,pos
  }
}'

This generates:

1 4
1 13
1 22
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
1

With your shown samples, please try following awk code.

awk '
{
  prev=0
  while(match($0,/ABC/)){
    $0=substr($0,RSTART+RLENGTH)
    print FNR,prev+RSTART
    prev+=RSTART+2
  }
}
'  Input_file

Explanation: Adding detailed explanation for above.

awk '                              ##Starting awk program from here.
{
  prev=0                           ##Setting prev variable to 0 here.
  while(match($0,/ABC/)){          ##Using while loop to match ABC string and it runs till ABC match is ture in current line.
    $0=substr($0,RSTART+RLENGTH)   ##Re-creating current line by assigning value of rest of line(which starts after match of ABC).
    print FNR,prev+RSTART          ##Printing line number along with prev+RSTART value here.
    prev+=RSTART+2                 ##Setting prev to prev+RSTART+2 here.
  }
}
'  Input_file                      ##Mentioning Input_file name here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93