How to print the row number and starting location of a pattern when multiple matches per row are present?

Question

I want to use awk to match all the occurrences of a pattern within a large file. For each match, I would like to print the row number and the starting position of the pattern along the row (sort of xy coordinates). There are several occurrences of the pattern in each line. I found this somewhat related question.

So far, I managed to do it only for the first (leftmost) occurrence in each line. As an example:

echo xyzABCdefghiABCdefghiABCdef | awk 'match($0, /ABC/) {print NR, RSTART } '

The resulting output is :

1 4

But what I would expect is something like this:

1 4
1 13
1 22

I tried using split instead of match. I manage to identify all the occurrences, but the RSTART is lost and printed as "0".

echo xyzABCdefghiABCdefghiABCdef | awk ' { split($0,t, /ABC/,m) ; for (i=1; i in m; i++) print (NR, RSTART) } '

Output:

1 0
1 0
1 0

Any advice would be appreciated. I am not limited to using awk but a awk solution would be appreciated. Also, in my case the pattern to match would be a regex (/A.C/). Thank you

how to deal with overlaps? eg, assume the input is `01010` and you're looking for `010` ... 1 or 2 matches? — markp-fuso, Jan 25 '22 at 22:15
In my current situation I am sure I have no overlaps in the file. Hypothetically I think it would be best to have a way to deal with overlaps as two separate matches. — RicGGG, Jan 26 '22 at 05:38
Please read [how-do-i-find-the-text-that-matches-a-pattern](https://stackoverflow.com/questions/65621325/how-do-i-find-the-text-that-matches-a-pattern) and then replace the word "pattern" with string or regexp everywhere it occurs in your question. You should also use regexp metachars in your search "pattern" and include those in your sample input/output so we have something we can really test with. — Ed Morton, Jan 26 '22 at 12:21
Also include overlapping strings in your example like search for `ABCABC` in the string `ABCABCABCABC` to show if the output should be `1 1\n1 7` or `1 1\n1 4\n1 7` or something else. As written you could get answers that produce the expected output from the sample input but don't actually do whatever it is you need. — Ed Morton, Jan 26 '22 at 12:25

score 2 · Answer 1 · answered Jan 25 '22 at 22:10

2

This may be what you're trying to do:

echo xyzABCdefghiABCdefghiABCdef | 
awk '{ begpos=1
       while (match(substr($0, begpos), /ABC/)) {
           print NR, begpos + RSTART - 1
           begpos += RLENGTH + RSTART - 1
       }
     }'

answered Jan 25 '22 at 22:10

M. Nejat Aydin

9,597
1
7
17

score 2 · Answer 2 · answered Jan 25 '22 at 22:51

Another option using gnu awk could be using split with a regex.

Using the split function, the 3rd field is the fieldsep array and the 4th field is the seps array which you can both use to calculate the positions.

echo xyzABCdefghiABCdefghiABCdef | 
awk ' { 
  n=split($0, a, /ABC/, seps); pos=1
  for(i=1; i<n; i++){
    pos += length(a[i])
    print NR, pos
    pos += length(seps[i])
  } 
}'

Output

1 4
1 13
1 22

Cyrus · Answer 3 · 2022-01-25T22:03:09.497

1

Determination of the coordinates of a string with awk:

echo "xyzABCdefghiABCdefghiABCdef" \
  | awk -v s="ABC" 'BEGIN{ len=length(s) }
      {
        for(i=1; i<=length($0); i++){
          if(substr($0, i, len)==s){
            print NR, i
          }
        }
      }'

Output:

1 4
1 13
1 22

As one line:

echo xyzABCdefghiABCdefghiABCdef | awk -v s="ABC" 'BEGIN{ len=length(s) } { for(i=1; i<=length($0); i++){ if(substr($0,i,len)==s) { print NR,i } } }'

Source: Find position of character with awk

edited Jan 25 '22 at 22:03

answered Jan 25 '22 at 21:54

Cyrus

84,225
14
89
153

This works very well in the example I posted, but it fails to find matches when my pattern is a regular expression (such as /A.C/ ). I guess it just needs a small tweak for that. – RicGGG Jan 26 '22 at 09:03

score 1 · Answer 4 · answered Jan 25 '22 at 22:10

1

One awk idea using split() and some slicing-n-dicing of length() results:

ptn='ABC'

echo xyzABCdefghiABCdefghiABCdef | 
awk -v ptn="${ptn}" '
{ pos=-(length(ptn)-1)
  n=split($0,arr,ptn)
  for (i=1;i<n;i++) { 
      pos+=length(arr[i] ptn)
      print NR,pos
  }
}'

This generates:

1 4
1 13
1 22

answered Jan 25 '22 at 22:10

markp-fuso

28,790
4
16
36

Thank you. This is the solution that I ended up using in my code. – RicGGG Jan 26 '22 at 09:59

score 1 · Accepted Answer · answered Jan 26 '22 at 00:03

With your shown samples, please try following awk code.

awk '
{
  prev=0
  while(match($0,/ABC/)){
    $0=substr($0,RSTART+RLENGTH)
    print FNR,prev+RSTART
    prev+=RSTART+2
  }
}
'  Input_file

Explanation: Adding detailed explanation for above.

awk '                              ##Starting awk program from here.
{
  prev=0                           ##Setting prev variable to 0 here.
  while(match($0,/ABC/)){          ##Using while loop to match ABC string and it runs till ABC match is ture in current line.
    $0=substr($0,RSTART+RLENGTH)   ##Re-creating current line by assigning value of rest of line(which starts after match of ABC).
    print FNR,prev+RSTART          ##Printing line number along with prev+RSTART value here.
    prev+=RSTART+2                 ##Setting prev to prev+RSTART+2 here.
  }
}
'  Input_file                      ##Mentioning Input_file name here.

I just wanted to add that ` prev+=RSTART+2 ` should be changed based on the length of the pattern. 2 works if the patterna is 3 characters long. — RicGGG, Jan 26 '22 at 10:55
@RicGGG, yes you can change it to `prev+=RSTART+RLENGTH-1` I didn't check it but it should work. — RavinderSingh13, Jan 26 '22 at 11:42
Yes, I saw the links. My reputation is too low to up-vote, though. — RicGGG, Jan 26 '22 at 13:00

How to print the row number and starting location of a pattern when multiple matches per row are present?

5 Answers5