3

I have a file as follows. I would like to count the number of each character.

>1DMLA
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS
NALTKAGQAAANAKTVYGENTHRTFSVVVDDCSMRAVLRRLQVGGGTLKFFLTTPVPSLCVTATGPNAVSAVFLLKPQK
>1DMLB
DDVAARLRAAGFGAVGAGATAEETRRMLHRAFDTLA
>2BHDC
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS

I tried the following code.

awk '/^>/ { res=substr($0, 2); } /^[^>]/ { print res " - " length($0); }' <file

The output of the above code is

1DMLA - 80
1DMLA - 80
1DMLA - 80
1DMLA - 79
1DMLB - 36
2BHDC - 80
2BHDC - 80
2BHDC - 80

My desired output is

1DMLA - 319
1DMLB - 36
2BHDC - 240

How do I change the above code for getting my desired output?

3 Answers3

0

This way:

awk -F\> '/^>/ {if (seqlen != ""){print seqlen}printf("%s - ",$2);seqlen=0;next}seqlen != ""{seqlen +=length($0)}END{print seqlen}' infile

Or formatted:

awk -F\> '/^>/ { if (seqlen != "")
                    print seqlen
                 printf("%s - ",$2)
                 seqlen=0
                next } 
          seqlen != ""{seqlen+=length($0)}
          END{
             print seqlen}' infile

see: Sequence length of FASTA file

Apart from the expected result, this will handle these unexpected file formats.

$ cat infile
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS
NALTKAGQAAANAKTVYGENTHRTFSVVVDDCSMRAVLRRLQVGGGTLKFFLTTPVPSLCVTATGPNAVSAVFLLKPQK
>1DMLB
>2BHDC
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS


$ awk -F\> '/^>/ {if (seqlen != ""){print seqlen}printf("%s - ",$2);seqlen=0;next}seqlen != ""{seqlen +=length($0)}END{print seqlen}' kk2
1DMLB - 0
2BHDC - 240
Community
  • 1
  • 1
Juan Diego Godoy Robles
  • 14,447
  • 2
  • 38
  • 52
0

Here's one way using awk:

awk '/^>/ && r { print r, "-", s; r=s="" } /^>/ { r = substr($0, 2); next } { s += length } END { print r, "-", s }' file

Results:

1DMLA - 319
1DMLB - 36
2BHDC - 240
Steve
  • 51,466
  • 13
  • 89
  • 103
0
awk -vRS='>' '$1{gsub( "[\r]", "",$1 ); 
              printf "%s - %d\n", $1, length($0) - length($1) - NF + 1}' input
perreal
  • 94,503
  • 21
  • 155
  • 181