1

I have this tabulated file as shown:

1    MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK   V
2    MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK   M
.
.
And so on... 

The first column is the number, second column corresponds to protein sequence and third column is the last character and the pattern to find in the corresponding sequence for each case.

Thus, the desired output will be something like that:

1:positions:4 23 43 53 56 65 68 91 92 100 120 123 125
2:positions:1 18 22 110 134

I have tried with awk and index function.

nawk -F'\t' -v p=$3 'index($2,p) {printf "%s:positions:", NR; s=$2; m=0; while((n=index(s, p))>0) {m+=n; printf "%s ", m; s=substr(s, n+1)} print ""}' "file.tsv"

However it works only specifying the variable -v as a character or string but not $3. How can I get it in unix environment? Thanks in advance

Ravi Saroch
  • 934
  • 2
  • 13
  • 28

7 Answers7

1

You can do:

awk -F'\t' '{ len=split($2,arr,""); printf "%s:positions:",$1 ; for(i=0;i<len;i++) { if(arr[i] == $3 ) { printf "%s ",i } }; print "" }' file.tsv

First split the subject $2 entirely into an array, then loop it, check for occurances of $3 and print the array index when found

arco444
  • 22,002
  • 12
  • 63
  • 67
  • You should mention that will only work in some awks since it's relying on undefined behavior per POSIX (split() with a null string separator creating an array of characters). Gawk will do what you want, some other awks won't. – Ed Morton Aug 03 '17 at 13:40
0

Perl to the rescue:

perl -wane '
    print "$F[0]:positions:";
    $i = 0;
    print " ", $i while ($i = 1 + index $F[1], $F[2], $i) > 0;
    print "\n";
' -- file

If the space after : is a problem, you can complicate it to

$i = $f = 0;
$f = print " " x $f, $i while ($i = 1 + index $F[1], $F[2], $i) > 0;
choroba
  • 231,213
  • 25
  • 204
  • 289
0

gawk solution:

awk -v FPAT="[[:digit:]]+|[[:alpha:]]" '{ 
       r=$1":positions:"; for(i=2;i<NF;i++) { if($i==$NF) r=r" "i-1 } print r 
    }' file.tsv
  • FPAT="[[:digit:]]+|[[:alpha:]]" - regex pattern defining field value

  • for(i=2;i<NF;i++) - iterating though the fields (letters of the 2nd column)


The output:

1:positions: 4 23 43 53 56 65 68 91 92 100 120 123 125
2:positions: 1 18 22 110 134
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
0
awk '{
  str=$1":positions:";
  n=0;split($2,a,$3);              # adopt $3 as the delimeter to split $2
  for(i=1;i<length(a);i++){        # save the result to a
    n+=length(a[i])+1;str=str" "n  # locate the delimeter $3 by compute n+length(a[i])+1
  }
  print str
}' file.tsv
CWLiu
  • 3,913
  • 1
  • 10
  • 14
0
$ awk '{out=$1 ":positions:"; for (i=1;i<=length($2);i++) { c=substr($2,i,1); if (c == $3) out = out " " i}; print out}' file
1:positions: 4 23 43 53 56 65 68 91 92 100 120 123 125
2:positions: 1 18 22 110 134
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

Simple perl solution

use strict;
use warnings;

while( <DATA> ) {
    chomp;

    next if /^\s*$/;        # just in case if you have empty line

    my @data = split "\t";  # record is tabulated

    my %result;             # hash to store result
    my $c = 0;              # position in the string

    map { $c++; push @{$result{$data[0]}}, $c if $_ eq $data[2] } split '', $data[1];

    print "$data[0]:position:"
          . join(' ', @{$result{$data[0]}}) # assemble result to desired form
          . "\n";
}

__DATA__
1   MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK   V

2   MGNVFEKLFKSLFGKKEMRILMVGLDAAGKTTILYKLKLGEIVTTIPTIGFNVETVEYKNISFTVWDVGGQDKIRPLWRHYFQNTQGLIFVVDSNDRERVNEAREELTRMLAEDELRDAVLLVFVNKQDLPNAMNAAEITDKLGLHSLRQRNWYIQATCATSGDGLYEGLDWLSNQLKNQK   M
Polar Bear
  • 6,762
  • 1
  • 5
  • 12
-1

I would use a small script, which goes through every line of your file, gets the last field as search_string and then use grep to get the position of the search_string. All you have to do now is shift the result, since you have an offset of 1. The sed command removes new lines from the grep output.

while read p; do
    search_string=`echo $p |awk '{print $NF}'`
    echo $p |grep -aob $search_string  | sed ':a;N;$!ba;s/\n/ /g'
done < file.tsv
schorsch312
  • 5,553
  • 5
  • 28
  • 57
  • That would be immensely slow. It also contains some fundamental shell scriptin errors (not using IFS= and -r on the while read loop, unquoted variables, deprecates backticks) and non-portable code (sed and echo) and chains of pipes+external commands where shell builtins can do the same thing. – Ed Morton Aug 03 '17 at 13:36