16

I have the following FASTA file:

>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT

My desired output:

>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.

This is my code:

awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa

The output I get with this code is:

>header1
60
57
>header2
3
>header3
7

I need a small modification in order to deal with multiple sequence lines.

I also need a way to have the total sequences and total length. Any suggestion will be welcome... In bash or awk, please. I know that is easy to do it in Perl/BioPerl and actually, I have a script to do it in those ways.

kvantour
  • 25,269
  • 4
  • 47
  • 72
cucurbit
  • 1,422
  • 1
  • 13
  • 32

4 Answers4

24

An awk / gawk solution can be composed by three stages:

  1. Every time header is found these actions should be performed:

    • Print previous seqlen if exists.
    • Print tag.
    • Initialize seqlen.
  2. For the sequence lines we just need to accumulate totals.
  3. Finally at the END stage we print the remnant seqlen.

Commented code:

awk '/^>/ { # header pattern detected
        if (seqlen){
         # print previous seqlen if exists 
         print seqlen
         }

         # pring the tag 
         print

         # initialize sequence
         seqlen = 0

         # skip further processing
         next
      }

# accumulate sequence length
{
seqlen += length($0)
}
# remnant seqlen if exists
END{if(seqlen){print seqlen}}' file.fa

A oneliner:

awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa

For the totals:

awk '/^>/ { if (seqlen) {
              print seqlen
              }
            print

            seqtotal+=seqlen
            seqlen=0
            seq+=1
            next
            }
    {
    seqlen += length($0)
    }     
    END{print seqlen
        print seq" sequences, total length " seqtotal+seqlen
    }' file.fa
Juan Diego Godoy Robles
  • 14,447
  • 2
  • 38
  • 52
  • Yeah, this works. But also I need a "final line" with the total of sequences and the total length, following the example: 3 sequences, total length 127. (Sorry, in the question this is behind a # ) – cucurbit Jun 02 '14 at 10:57
  • Just minor formatting change. `awk '/^>/ {if (seqlen) print seqlen;print;seqlen=0;next} {seqlen+=length($0)}END{print seqlen}'` – Jotne Jun 02 '14 at 10:57
  • A very old topic, but it is just what I need! Is it also possible to adjust the one liner awk command that it prints the number of nucleotides on the same line as the header, and not on a new line, like: >header1 60 – Gravel May 17 '17 at 10:37
  • 1
    Sure @Gravel just place `; printf $0" " ;` instead of `; print ;` – Juan Diego Godoy Robles May 22 '17 at 14:29
  • this will include newline characters in the count – Brian Wiley Sep 08 '20 at 06:45
  • sorry maybe its counting carriage return. just insert `BEGIN{RS="\r\n"}` at the beginning – Brian Wiley Sep 08 '20 at 07:08
1

I wanted to share some tweaks to klashxx's answer that might be useful. Its output differs in that it prints the sequence id and its length on one line, It's no longer a one-liner, so the downside is you'll have to save it as a script file.

It also parses out the sequence id from the header line, based on whitespace (chrM in >chrM gi|251831106|ref|NC_012920.1|). Then, you can select a specific sequence based on the id by setting the variable target like so: $ awk -f seqlen.awk -v target=chrM seq.fa.

BEGIN {
  OFS = "\t"; # tab-delimited output
}
# Use substr instead of regex to match a starting ">"
substr($0, 1, 1) == ">" {
  if (seqlen) {
    # Only print info for this sequence if no target was given
    # or its id matches the target.
    if (! target || id == target) {
      print id, seqlen;
    }
  }
  # Get sequence id:
  # 1. Split header on whitespace (fields[1] is now ">id")
  split($0, fields);
  # 2. Get portion of first field after the starting ">"
  id = substr(fields[1], 2);
  seqlen = 0;
  next;
}
{
  seqlen = seqlen + length($0);
}
END {
  if (! target || id == target) {
    print id, seqlen;
  }
}
Nick S
  • 555
  • 4
  • 17
1

A quick way with any awk, would be this:

awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' file.fasta

You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files

bioawk -c fastx '{print ">" $name ORS length($seq)}' file.fasta

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

kvantour
  • 25,269
  • 4
  • 47
  • 72
0

"seqkit" is a quick way:

seqkit fx2tab --length --name --header-line  sequence.fa

I also modified your code to handle multiple sequence lines; you can concatenate the sequence lines and calculate the length of the combined sequence. Here's an updated version of your code:

awk '/^>/ {if (seq) { print length(seq); seq=""; } print; next; } { seq = seq $0; } END { print length(seq); }' file.fa

This modified code checks for lines starting with >, indicating a header line. If a sequence line is encountered, it appends the sequence to the seq variable. When a new header line is found or the end of the file is reached (END block), it prints the length of the concatenated sequence.

The output will be:

>header1
117
>header2
3
>header3
7
127

To extract the total number of sequences and total length separately, you can store the values in variables and print them at the end:

awk '/^>/ {if (seq) { print length(seq); seq=""; seqCount++; } print; next; } { seq = seq $0; } END { print length(seq); seqCount++; print "#", seqCount, "sequences, total length", length(seq) "." }' file.fa

The updated output will be:

>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.

This code keeps track of the number of sequences encountered in the seqCount variable and increments it each time a sequence is printed. At the end, it prints the desired output with the total sequence count and total length.

Nar_sys
  • 9
  • 4
  • Excellent answer....straight to the point. – user3105519 Mar 12 '23 at 00:58
  • 1
    This answer looks like it was generated by an AI (like ChatGPT), not by an actual human being. You should be aware that [posting AI-generated output is officially **BANNED** on Stack Overflow](https://meta.stackoverflow.com/q/421831). If this answer was indeed generated by an AI, then I strongly suggest you delete it before you get yourself into even bigger trouble: **WE TAKE PLAGIARISM SERIOUSLY HERE.** Please read: [Why posting GPT and ChatGPT generated answers is not currently acceptable](https://stackoverflow.com/help/gpt-policy). – tchrist Jul 07 '23 at 13:26