14

I have a fasta file as shown below. I would like to convert the three letter codes to one letter code. How can I do this with python or R?

>2ppo
ARGHISLEULEULYS
>3oot
METHISARGARGMET

desired output

>2ppo
RHLLK
>3oot
MHRRM

your suggestions would be appreciated!!

user1725152
  • 141
  • 1
  • 1
  • 4
  • How is `ARGHISLEULEULYS` converted to `RHLLK`? What is the logic? –  Oct 06 '12 at 13:41
  • @Tichodroma: ARG = R, HIS = H, LEU = L, etc – Junuxx Oct 06 '12 at 13:42
  • 1
    @Junuxx etc.? It would be useful to add the complete translation list to the question or at least link to it. I'd like to help with this question but I'm unable unless I get all necessary information. –  Oct 06 '12 at 13:43
  • @Tichodroma: http://en.wikipedia.org/wiki/Amino_acid#Table_of_standard_amino_acid_abbreviations_and_properties – Junuxx Oct 06 '12 at 13:44
  • ah, so you need to split the string into an array take every 3rd element of the array as your final string? – caitriona Oct 06 '12 at 13:45
  • How about: https://stat.ethz.ch/pipermail/bioconductor/2008-January/020958.html – Ben Bolker Oct 06 '12 at 22:18
  • 4
    I'm curious where you found such a file - I've never seen a FASTA file using three letter amino acid codes like that. – Peter Cock Dec 07 '12 at 15:57

11 Answers11

18

BioPython already has built-in dictionaries to help with such translations. Following commands will show you a whole list of available dictionaries:

import Bio
help(Bio.SeqUtils.IUPACData)

The predefined dictionary you are looking for:

Bio.SeqUtils.IUPACData.protein_letters_3to1['Ala']
Henk Neefs
  • 181
  • 1
  • 3
  • 1
    This ought to be the chosen answer. A small note: In Python3 at least the method is actually under the module `Bio.Data`, while `Bio.SeqUtilis` imports it from there, therefore if one wanted only the method protein_letters_3to1 in the current namespace one could do: `from Bio.Data.IUPACData import protein_letters_3to1` – Matteo Ferla Jun 10 '19 at 16:00
17

Use a dictionary to look up the one letter codes:

d = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
     'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N', 
     'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W', 
     'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M'}

And a simple function to match the three letter codes with one letter codes for the entire string:

def shorten(x):
    if len(x) % 3 != 0: 
        raise ValueError('Input length should be a multiple of three')

    y = ''
    for i in range(len(x) // 3):
        y += d[x[3 * i : 3 * i + 3]]
    return y

Testing your example:

>>> shorten('ARGHISLEULEULYS')
'RHLLK'
Junuxx
  • 14,011
  • 5
  • 41
  • 71
  • Thank you very much for your answer. I am new to python. How can I parse the input file to your code? – user1725152 Oct 06 '12 at 14:25
  • @user1725152: That depends on the format of the input file. But I imagine it could be something like `for line in inputfile: print(shorten(line))`. – Junuxx Oct 06 '12 at 14:27
  • len(x) / 3 returns a float so if you get the error `TypeError: 'float' object cannot be interpreted as an integer` Simply change it to: ``` for i in range(int(len(x)/3)): ``` – universvm Jan 14 '22 at 11:19
  • @universvm: Thanks for the comment. This is from 2012, so it was written in Python 2 where `len(x) / 3` would return an int. Updated the answer to use integer division. – Junuxx Jan 14 '22 at 20:44
7

Here is a way to do it in R:

# Variables:
foo <- c("ARGHISLEULEULYS","METHISARGARGMET")

# Code maps:
code3 <- c("Ala", "Arg", "Asn", "Asp", "Cys", "Glu", "Gln", "Gly", "His", 
"Ile", "Leu", "Lys", "Met", "Phe", "Pro", "Ser", "Thr", "Trp", 
"Tyr", "Val")
code1 <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", 
"M", "F", "P", "S", "T", "W", "Y", "V")

# For each code replace 3letter code by 1letter code:
for (i in 1:length(code3))
{
    foo <- gsub(code3[i],code1[i],foo,ignore.case=TRUE)
}

Results in :

> foo
[1] "RHLLK" "MHRRM"

Note that I changed the variable name as variable names are not allowed to start with a number in R.

Sacha Epskamp
  • 46,463
  • 20
  • 113
  • 131
  • 1
    This isn't good. Take TRPHISGLU as an example, you expect the algorithm to translate as follows {TRP}{HIS}{GLU} -> WHE but what really happens with your algorithm is TRP{HIS}{GLU} -> TR{PHE} -> TRF. You do need to split `foo` into substrings of three characters to avoid such possible interactions. – flodel Oct 06 '12 at 18:30
6
>>> src = "ARGHISLEULEULYS"
>>> trans = {'ARG':'R', 'HIS':'H', 'LEU':'L', 'LYS':'K'}
>>> "".join(trans[src[x:x+3]] for x in range(0, len(src), 3))
'RHLLK'

You just need to add the rest of the entries to the trans dict.

Edit:

To make the rest of trans, you can do this. File table:

Ala A
Arg R
Asn N
Asp D
Cys C
Glu E
Gln Q
Gly G
His H
Ile I
Leu L
Lys K
Met M
Phe F
Pro P
Ser S
Thr T
Trp W
Tyr Y
Val V

Read it:

trans = dict((l.upper(), s) for l, s in
             [row.strip().split() for row in open("table").readlines()])
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
4

You may try looking into and installing Biopython since you are parsing a .fasta file and then converting to one letter codes. Unfortunately, Biopython only has the function seq3(in package Bio::SeqUtils) which does the inverse of what you want. Example output in IDLE:

>>>seq3("MAIVMGRWKGAR*")
>>>'MetAlaIleValMetGlyArgTrpLysGlyAlaArgTer'

Unfortunately, there is no 'seq1' function (yet...) but I thought this might be helpful to you in the future. As far as your problem, Junuxx is correct. Create a dictionary and use a for loop to read the string in blocks of three and translate. Here is a similar function to the one he provided that is all-inclusive and handles lower cases as well.

def AAcode_3_to_1(seq):
    '''Turn a three letter protein into a one letter protein.

    The 3 letter code can be upper, lower, or any mix of cases
    The seq input length should be a factor of 3 or else results
    in an error

    >>>AAcode_3_to_1('METHISARGARGMET')
    >>>'MHRRM'

    '''
    d = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
     'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N', 
     'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W', 'TER':'*',
     'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M','XAA':'X'}

    if len(seq) %3 == 0:
        upper_seq= seq.upper()
        single_seq=''
        for i in range(len(upper_seq)/3):
            single_seq += d[upper_seq[3*i:3*i+3]]
        return single_seq
    else:
        print("ERROR: Sequence was not a factor of 3 in length!")
Wes Field
  • 3,291
  • 6
  • 23
  • 26
  • You'll be able to use `Bio.SeqUtils.seq1` as of the next release, Biopython 1.61 (or run from the github repository if you like being on the leading edge). – Peter Cock Dec 07 '12 at 15:56
4

Biopython has a nice solution

>>> from Bio.PDB.Polypeptide import *
>>> three_to_one('ALA')
'A'

For your example, I'll solve it by this one liner

>>> from Bio.PDB.Polypeptide import *
>>> str3aa = 'ARGHISLEULEULYS'
>>> "".join([three_to_one(aa3) for aa3 in [ "".join(g) for g in zip(*(iter(str3aa),) * 3)]])
>>> 'RHLLK'

They may criticize me for this type of one liner :), but deep in my heart I am still in love with PERL.

ghosh'.
  • 1,567
  • 1
  • 14
  • 19
3

Using R:

convert <- function(l) {

  map <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I",
           "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")

  names(map) <- c("ALA", "ARG", "ASN", "ASP", "CYS", "GLU", "GLN",
                  "GLY", "HIS", "ILE", "LEU", "LYS", "MET", "PHE",
                  "PRO", "SER", "THR", "TRP", "TYR", "VAL")

  sapply(strsplit(l, "(?<=[A-Z]{3})", perl = TRUE),
         function(x) paste(map[x], collapse = ""))
}

convert(c("ARGHISLEULEULYS", "METHISARGARGMET"))
# [1] "RHLLK" "MHRRM"
flodel
  • 87,577
  • 21
  • 185
  • 223
  • +1 for the clever method of splitting a string into 3-character substrings. It demonstrates something interesting about how regex-matching works. – Josh O'Brien Oct 06 '12 at 21:33
  • @fodel Thank you very much for your answer. I have more than 1000 sequences. it is in a text file. First I have to import this file in to r and has to change the three letter codes to one letter.I have shown the desired output.If you can, please help me. – user1725152 Oct 07 '12 at 00:08
  • The function I showed you takes a vector of sequences as input. How to read a FASTA file into a vector of sequences in R is a different question. A quick Google search and I can point you to at least three different packages: `Biostrings (readFASTA)`, `seqinr (read.fasta)`, `bio3d (read.fasta)`. – flodel Oct 07 '12 at 00:37
2

Another way to do it is with the seqinr and iPAC package in R.

# install.packages("seqinr")
# source("https://bioconductor.org/biocLite.R")
# biocLite("iPAC")

library(seqinr)
library(iPAC)

#read in file
fasta = read.fasta(file = "test_fasta.fasta", seqtype = "AA", as.string = T, set.attributes = F)
#split string
n = 3
fasta1 = lapply(fasta,  substring(x,seq(1,nchar(x),n),seq(n,nchar(x),n)))
#convert the three letter code for each element in the list 
fasta2 = lapply(fasta1, function(x) paste(sapply(x, get.SingleLetterCode), collapse = ""))

# > fasta2
# $`2ppo`
# [1] "RHLLK"
#
# $`3oot`
# [1] "MHRRM"
paul_dg
  • 511
  • 5
  • 16
2

For those who land here on 2017 and beyond:

Here's a single line Linux bash command to convert protein amino acid three letter code to single letter code in a text file. I know this is not very elegant, but I hope this helps someone searching for the same and want to use single line command.

sed 's/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/HID/H/g;s/HIE/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g;s/MSE/X/g' < input_file_three_letter_code.txt > output_file_single_letter_code.txt

Solution for the original question above, as a single command line:

sed 's/.\{3\}/& /g' | sed 's/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/HID/H/g;s/HIE/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g;s/MSE/X/g' | sed 's/ //g' < input_file_three_letter_code.txt > output_file_single_letter_code.txt

Explanation:

[1] sed 's/.\{3\}/& /g' will spllit the sequence. It will add a space after every 3rd letter.

[2] The second 'sed' command in the pipe will take the output of above and convert to single letter code. Add any non-standard residue as s/XYZ/X/g; to this command.

[3] The third 'sed' command, sed 's/ //g' will remove white-space.

Insilico
  • 866
  • 9
  • 10
1
my %aa_hash=(
  Ala=>'A',
  Arg=>'R',
  Asn=>'N',
  Asp=>'D',
  Cys=>'C',
  Glu=>'E',
  Gln=>'Q',
  Gly=>'G',
  His=>'H',
  Ile=>'I',
  Leu=>'L',
  Lys=>'K',
  Met=>'M',
  Phe=>'F',
  Pro=>'P',
  Ser=>'S',
  Thr=>'T',
  Trp=>'W',
  Tyr=>'Y',
  Val=>'V',
  Sec=>'U',                       #http://www.uniprot.org/manual/non_std;Selenocysteine (Sec) and pyrrolysine (Pyl)
  Pyl=>'O',
);


    while(<>){
            chomp;
            my $aa=$_;
            warn "ERROR!! $aa invalid or not found in hash\n" if !$aa_hash{$aa};
            print "$aa\t$aa_hash{$aa}\n";
    }

Use this perl script to convert triplet a.a codes to single letter code.

0

Python 3 solutions.

In my work, the annoyed part is that the amino acid codes can refer to the modified ones which often appear in the PDB/mmCIF files, like

'Tih'-->'A'.

So the mapping can be more than 22 pairs. The 3rd party tools in Python like

Bio.SeqUtils.IUPACData.protein_letters_3to1

cannot handle it. My easiest solution is to use the http://www.ebi.ac.uk/pdbe-srv/pdbechem to find the mapping and add the unusual mapping to the dict in my own functions whenever I encounter them.

def three_to_one(three_letter_code):
    mapping = {'Aba':'A','Ace':'X','Acr':'X','Ala':'A','Aly':'K','Arg':'R','Asn':'N','Asp':'D','Cas':'C',
           'Ccs':'C','Cme':'C','Csd':'C','Cso':'C','Csx':'C','Cys':'C','Dal':'A','Dbb':'T','Dbu':'T',
           'Dha':'S','Gln':'Q','Glu':'E','Gly':'G','Glz':'G','His':'H','Hse':'S','Ile':'I','Leu':'L',
           'Llp':'K','Lys':'K','Men':'N','Met':'M','Mly':'K','Mse':'M','Nh2':'X','Nle':'L','Ocs':'C',
           'Pca':'E','Phe':'F','Pro':'P','Ptr':'Y','Sep':'S','Ser':'S','Thr':'T','Tih':'A','Tpo':'T',
           'Trp':'W','Tyr':'Y','Unk':'X','Val':'V','Ycm':'C','Sec':'U','Pyl':'O'} # you can add more
    return mapping[three_letter_code[0].upper() + three_letter_code[1:].lower()]

The other solution is to retrieve the mapping online (But the url and the html pattern may change through time):

import re
import urllib.request

def three_to_one_online(three_letter_code):
    url = "http://www.ebi.ac.uk/pdbe-srv/pdbechem/chemicalCompound/show/" + three_letter_code
    with urllib.request.urlopen(url) as response:
        single_letter_code = re.search('\s*<td\s*>\s*<h3>One-letter code.*</h3>\s*</td>\s*<td>\s*([A-Z])\s*</td>', response.read().decode('utf-8')).group(1)
    return single_letter_code

Here I directly use the re instead of the html parsers for the simplicity.

Hope these can help.

Young
  • 631
  • 6
  • 6