using awk to print header name and a substring

Question

i try using this code for printing a header of a gene name and then pulling a substring based on its location but it doesn't work

>output_file
cat input_file | while read row; do
        echo $row > temp
        geneName=`awk '{print $1}' tmp`
        startPos=`awk '{print $2}' tmp`
        endPOs=`awk '{print $3}' tmp`
                for i in temp; do
                echo ">${geneName}" >> genes_fasta ;
                echo "awk '{val=substr($0,${startPos},${endPOs});print val}' fasta" >> genes_fasta
        done
done

input_file

nad5_exon1 250405 250551
nad5_exon2 251490 251884
nad5_exon3 195620 195641
nad5_exon4 154254 155469
nad5_exon5 156319 156548

fasta

atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc............

and this is my wrong output file

>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta

output should look like that:

>name1
atgcatgcatgcatgcatgcat
>name2
tgcatgcatgcatgcat
>name3
gcatgcatgcatgcatgcat
>namen....

An example of both the input file (with a handful of cases) and the expected result would be useful to get a better sense of what you're trying to accomplish, here. — David Allewell, Nov 22 '19 at 23:54
Looks good, but a "correct" output of what you would actually like it do to would also be invaluable, as I'm still not following what you're trying to accomplish. Also, your `awk` lines setting your variables are looking at `tmp`, but the file you create is called `temp`. — David Allewell, Nov 23 '19 at 00:39
please fix issues found by https://shellcheck.net and then update your Q with required output from sample input. You know that can all be one `awk` process, rather than 5-6 per line of data? See http://grymoire.com/Unix/Awk.html . Good luck. — shellter, Nov 23 '19 at 01:07
You need to remove the "echo" and double quotes from the following `echo "awk '{val=substr($0,${startPos},${endPOs});print val}' fasta"`. This is a command right? If you put echo it simply prints the command and appends to file. — Vince, Nov 23 '19 at 02:22

David C. Rankin · Answer 1 · 2019-11-23T06:19:37.197

You can do this with a single call to awk which will be orders of magnitude more efficient than looping in a shell script and calling awk 4-times per-iteration. Since you have bash, you can simply use command substitution and redirect the contents of fasta to an awk variable and then simply output the heading and the substring containing the beginning through ending characters from your fasta file.

For example:

awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input

or using getline within the BEGIN rule:

awk 'BEGIN{getline fasta<"fasta"}
{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input

Example Input Files

Note: the beginning and ending values have been reduced to fit within the 129 characters of your example:

$ cat input
rad5_exon1 1 17
rad5_exon2 23 51
rad5_exon3 110 127
rad5_exon4 38 62
rad5_exon5 59 79

and the first 129-characters of your example fasta

$ cat fasta
atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc

Example Use/Output

$ awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
>rad5_exon1
atgcatgcatgcatgca
>rad5_exon2
gcatgcatgcatgcatgcatgcatgcatg
>rad5_exon3
tgcatgcatgcatgcatg
>rad5_exon4
tgcatgcatgcatgcatgcatgcat
>rad5_exon5
gcatgcatgcatgcatgcatg

Look thing over and let me know if I understood your question requirements. Also let me know if you have further questions on the solution.

score 1 · Answer 2 · answered Nov 23 '19 at 03:29

1

If I'm understanding correctly, how about:

awk 'NR==FNR {fasta = fasta $0; next}
    {
        printf(">%s %s\n", $1, substr(fasta, $2, $3 - $2 + 1))
    }' fasta input_file > genes_fasta

It first reads fasta file and stores the sequence in a variable fasta.
Then it reads input_file line by line, extracts the substring of fasta starting at $2 and of length $3 - $2 + 1. (Note that the 3rd argument to substr function is length, not endpos.)

Hope this helps.

answered Nov 23 '19 at 03:29

tshiono

21,248
2
14
22

@ZivAttia Can you be specific about `still not working`? – tshiono Nov 23 '19 at 04:47
'awk '{val=substr($0,"$startPos","$endPos");print val}' unwraped_${fasta} >> genes_fasta doesn't work for some reason. if i change the startPos and endPos to numbers it works fine – Ziv Attia Nov 23 '19 at 04:56
it didn't work and now I need to fix only this line – Ziv Attia Nov 23 '19 at 05:08
As I've commented in my answer, the 3rd argument to `substr` function is length, not the position. – tshiono Nov 23 '19 at 05:45
ok, for some reason it is not working for me. i try using this command awk -v b="$startPos" -v c="$endPos" '{val=substr($0,$b,$c);print val}' unwraped_${fasta} >> genes_fasta but it doesn't work either. how can i set a variable in an awk command? – Ziv Attia Nov 23 '19 at 05:48
I'd strongly recommend to switch to awk-only solution if you are not sure how to pass shell variables to awk including its efficiency. If my answer does not work (I don't know why), please try David Rankin's solution. – tshiono Nov 23 '19 at 06:11

score 1 · Accepted Answer · edited Nov 23 '19 at 06:55

1

made it work! this is the script for pulling substrings from a fasta file

cat genes_and_bounderies1 | while read row; do
        echo $row > temp
        geneName=`awk '{print $1}' temp`
        startPos=`awk '{print $2}' temp`
        endPos=`awk '{print $3}' temp`
        length=$(expr $endPos - $startPos)
                for i in temp; do
                echo ">${geneName}" >> genes_fasta
                awk -v S=$startPos -v L=$length '{print substr($0,S,L)}' unwraped_${fasta} >> genes_fasta
        done
done

edited Nov 23 '19 at 06:55

David C. Rankin

81,885
6
58
85

answered Nov 23 '19 at 06:53

Ziv Attia

57
7

1

Good, glad you got it working. But look closely at the number of subshells spawned per-iteration. For each iteration of your while loop, you call `awk` 3-times before your loop `for i in temp; do` and then once per iteration there.. Nothing wrong with making it work first, but with a million rows to read, the difference in time to execute would be hours verses seconds. – David C. Rankin Nov 23 '19 at 06:58
how would you suggest writing it? i'm a biologist not a programmer and would really like to learn! – Ziv Attia Nov 25 '19 at 15:56
2

The way I wrote it up in my answer. You see there is only a single-process (single subshell) spawned to run one instance of `awk` for the entire project. In your code you call `awk` a minimum of 4-times per-iteration. The program startup time alone along with loading and unloading the data used by each process will add orders of magnitude of runtime. – David C. Rankin Nov 25 '19 at 18:14

using awk to print header name and a substring

3 Answers3