Find positions of all occurrences of a pattern in a string when every line have different patterns defined in other column (UNIX)

Question

I have this tabulated file as shown:

1    MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK   V
2    MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK   M
.
.
And so on...

The first column is the number, second column corresponds to protein sequence and third column is the last character and the pattern to find in the corresponding sequence for each case.

Thus, the desired output will be something like that:

1:positions:4 23 43 53 56 65 68 91 92 100 120 123 125
2:positions:1 18 22 110 134

I have tried with awk and index function.

nawk -F'\t' -v p=$3 'index($2,p) {printf "%s:positions:", NR; s=$2; m=0; while((n=index(s, p))>0) {m+=n; printf "%s ", m; s=substr(s, n+1)} print ""}' "file.tsv"

However it works only specifying the variable -v as a character or string but not $3. How can I get it in unix environment? Thanks in advance

arco444 · Accepted Answer · 2017-08-03T13:27:50.217

1

You can do:

awk -F'\t' '{ len=split($2,arr,""); printf "%s:positions:",$1 ; for(i=0;i<len;i++) { if(arr[i] == $3 ) { printf "%s ",i } }; print "" }' file.tsv

First split the subject $2 entirely into an array, then loop it, check for occurances of $3 and print the array index when found

edited Aug 03 '17 at 13:27

answered Aug 03 '17 at 09:22

arco444

22,002
12
63
67

You should mention that will only work in some awks since it's relying on undefined behavior per POSIX (split() with a null string separator creating an array of characters). Gawk will do what you want, some other awks won't. – Ed Morton Aug 03 '17 at 13:40

score 0 · Answer 2 · answered Aug 03 '17 at 09:28

Perl to the rescue:

perl -wane '
    print "$F[0]:positions:";
    $i = 0;
    print " ", $i while ($i = 1 + index $F[1], $F[2], $i) > 0;
    print "\n";
' -- file

If the space after : is a problem, you can complicate it to

$i = $f = 0;
$f = print " " x $f, $i while ($i = 1 + index $F[1], $F[2], $i) > 0;

score 0 · Answer 3 · answered Aug 03 '17 at 09:46

gawk solution:

awk -v FPAT="[[:digit:]]+|[[:alpha:]]" '{ 
       r=$1":positions:"; for(i=2;i<NF;i++) { if($i==$NF) r=r" "i-1 } print r 
    }' file.tsv

FPAT="[[:digit:]]+|[[:alpha:]]" - regex pattern defining field value
for(i=2;i<NF;i++) - iterating though the fields (letters of the 2nd column)

The output:

1:positions: 4 23 43 53 56 65 68 91 92 100 120 123 125
2:positions: 1 18 22 110 134

CWLiu · Answer 4 · 2017-08-03T10:13:53.370

0

awk '{
  str=$1":positions:";
  n=0;split($2,a,$3);              # adopt $3 as the delimeter to split $2
  for(i=1;i<length(a);i++){        # save the result to a
    n+=length(a[i])+1;str=str" "n  # locate the delimeter $3 by compute n+length(a[i])+1
  }
  print str
}' file.tsv

edited Aug 03 '17 at 10:13

answered Aug 03 '17 at 09:55

CWLiu

3,913
1
10
14

Ed Morton · Answer 5 · 2017-08-03T13:42:02.503

0

$ awk '{out=$1 ":positions:"; for (i=1;i<=length($2);i++) { c=substr($2,i,1); if (c == $3) out = out " " i}; print out}' file
1:positions: 4 23 43 53 56 65 68 91 92 100 120 123 125
2:positions: 1 18 22 110 134

edited Aug 03 '17 at 13:42

answered Aug 03 '17 at 13:36

Ed Morton

188,023
17
78
185

score 0 · Answer 6 · answered Nov 20 '19 at 01:59

Simple perl solution

use strict;
use warnings;

while( <DATA> ) {
    chomp;

    next if /^\s*$/;        # just in case if you have empty line

    my @data = split "\t";  # record is tabulated

    my %result;             # hash to store result
    my $c = 0;              # position in the string

    map { $c++; push @{$result{$data[0]}}, $c if $_ eq $data[2] } split '', $data[1];

    print "$data[0]:position:"
          . join(' ', @{$result{$data[0]}}) # assemble result to desired form
          . "\n";
}

__DATA__
1   MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK   V

2   MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK   M

score -1 · Answer 7 · answered Aug 03 '17 at 09:12

-1

I would use a small script, which goes through every line of your file, gets the last field as search_string and then use grep to get the position of the search_string. All you have to do now is shift the result, since you have an offset of 1. The sed command removes new lines from the grep output.

while read p; do
    search_string=`echo $p |awk '{print $NF}'`
    echo $p |grep -aob $search_string  | sed ':a;N;$!ba;s/\n/ /g'
done < file.tsv

answered Aug 03 '17 at 09:12

schorsch312

5,553
5
28
57

That would be immensely slow. It also contains some fundamental shell scriptin errors (not using IFS= and -r on the while read loop, unquoted variables, deprecates backticks) and non-portable code (sed and echo) and chains of pipes+external commands where shell builtins can do the same thing. – Ed Morton Aug 03 '17 at 13:36

Find positions of all occurrences of a pattern in a string when every line have different patterns defined in other column (UNIX)

7 Answers7

Linked