2

I am working with some genetic data and one of my columns isn't in the format I want it to be. I don't know how much biology is talked about on here, but I am trying to fix how my amino acids are shown in my data.

Amino acids obviously have a name but they also have a 3 letter abbreviation and a 1 letter abbreviation. My data has the amino acids in the 3 letter form but I want to change them to the 1 letter abbreviation. Here is an example of my data.

 chr location           effect   impact AA_change
   1    12543 missense_variant MODERATE  p.Ala12Val
   1    52367 missense_variant MODERATE  p.Leu54Pro
   1   752347 missense_variant MODERATE  p.Met99Ser
   1   984645 missense_variant MODERATE  p.Lys34Ile
   1   989845 missense_variant MODERATE  p.Arg4Cys
   1   999854 missense_variant MODERATE  p.His43Gly
   1   999855 missense_variant MODERATE  p.Glu14Phe

dat <- structure(list(chr = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), location = c(12543L, 
52367L, 752347L, 984645L, 989845L, 999854L, 999855L), effect = c("missense_variant", 
"missense_variant", "missense_variant", "missense_variant", "missense_variant", 
"missense_variant", "missense_variant"), impact = c("MODERATE", 
"MODERATE", "MODERATE", "MODERATE", "MODERATE", "MODERATE", "MODERATE"
), AA_change = c("Ala12Val", "Leu54Pro", "Met99Ser", "Lys34Ile", 
"Arg4Cys", "His43Gly", "Glu14Phe")), .Names = c("chr", "location", 
"effect", "impact", "AA_change"), row.names = c(NA, -7L), class = "data.frame")

Here is a list of the 3 letter amino acids and what their one better abbreviation is.

  Ala == A
  Arg == R
  Asn == N
  Asp == D
  Cys == C
  Glu == E
  Gln == Q
  Gly == G
  His == H
  Ile == I
  Leu == L
  Lys == K
  Met == M
  Phe == F
  Pro == P
  Ser == S
  Thr == T
  Trp == W
  Tyr == Y
  Val == V

I feel like there is a simple function that can be made to do this but I am struggling to thing of how to do this. I am use to changing just one part of a column not two things at once. So what I am asking is how can I change this

Ala12Val
Leu54Pro
Met99Ser
Lys34Ile
Arg4Cys
His43Gly
Glu14Phe

To this

A12V
L54P
M99S
K32I
R4C
E14F

Is this something that can be done?

neuron
  • 1,949
  • 1
  • 15
  • 30
  • 3
    Is it always in this format "3 letters some digits and again 3 letters"? – zx8754 Jul 02 '18 at 20:05
  • 1
    If the format is fixed (per zx's query), you can split the col up and match/merge to update. If it's not fixed, you could use regex: https://stackoverflow.com/q/6954017/ – Frank Jul 02 '18 at 20:15
  • @zx8754 That is correct. It is always 3 letter, some digits, and then 3 letter again – neuron Jul 02 '18 at 20:18

3 Answers3

2

Make a lookup for amino acids, then get substring first 3 letters and map, extract digits, substring last 3 letters and map. Then paste all together.

# lookup map
AAmap <- setNames(c("A","R","N","D","C","E","Q","G","H","I","L","K","M","F","P","S","T","W","Y","V"),
                  c("Ala","Arg","Asn","Asp","Cys","Glu","Gln","Gly","His","Ile","Leu","Lys","Met","Phe","Pro","Ser","Thr","Trp","Tyr","Val"))

# get first 3 map to AA, get digits, get last 3 map to AA
dat$AA_change_short <-
  paste0(AAmap[ substr(dat$AA_change, 1, 3) ],
         gsub("[^\\d]+", "", dat$AA_change, perl = TRUE),
         AAmap[ substr(dat$AA_change, nchar(dat$AA_change) - 2, nchar(dat$AA_change)) ])

dat
#   chr location           effect   impact AA_change AA_change_short
# 1   1    12543 missense_variant MODERATE  Ala12Val            A12V
# 2   1    52367 missense_variant MODERATE  Leu54Pro            L54P
# 3   1   752347 missense_variant MODERATE  Met99Ser            M99S
# 4   1   984645 missense_variant MODERATE  Lys34Ile            K34I
# 5   1   989845 missense_variant MODERATE   Arg4Cys             R4C
# 6   1   999854 missense_variant MODERATE  His43Gly            H43G
# 7   1   999855 missense_variant MODERATE  Glu14Phe            E14F
zx8754
  • 52,746
  • 12
  • 114
  • 209
  • Shoot! I was actually wrong about the structure. It is 5 characters, some numbers, and then 3 characters. What you wrote works great! I am just getting NA for the first amino acids because it has a p. infront of the 3 letter abbreviation. – neuron Jul 02 '18 at 20:38
  • 2
    @Brian Then just remove first 2 characters, then follow same above code: `dat$AA_change <- gsub("p.", "", dat$AA_change, fixed = TRUE)` – zx8754 Jul 02 '18 at 20:40
2
b=which(adist(dat2$V1,dat$AA_change,partial = T)==0,T)

dat$AA_change1=`regmatches<-`(dat$AA_change,gregexpr("\\D+",dat$AA_change),
                                 value=split(dat2$V3[b[,1]],b[,2]))

dat
  chr location           effect   impact AA_change AA_change1
1   1    12543 missense_variant MODERATE  Ala12Val       A12V
2   1    52367 missense_variant MODERATE  Leu54Pro       L54P
3   1   752347 missense_variant MODERATE  Met99Ser       M99S
4   1   984645 missense_variant MODERATE  Lys34Ile       I34K
5   1   989845 missense_variant MODERATE   Arg4Cys        R4C
6   1   999854 missense_variant MODERATE  His43Gly       G43H
7   1   999855 missense_variant MODERATE  Glu14Phe       E14F



dat2 = read.table(text="Ala == A
  Arg == R
  Asn == N
  Asp == D
  Cys == C
  Glu == E
  Gln == Q
  Gly == G
  His == H
  Ile == I
  Leu == L
  Lys == K
  Met == M
  Phe == F
  Pro == P
  Ser == S
  Thr == T
  Trp == W
  Tyr == Y
  Val == V")[-2]
Onyambu
  • 67,392
  • 3
  • 24
  • 53
2

If it's always of the form {acid, numbers, acid} you could split it into three columns and make the substitution with match or a join. With data.table, this looks like...

library(data.table)
setDT(dat)

# put your mapping into a nicer format
abbrDT = fread(header = FALSE,"
  Ala == A
  Arg == R
  Asn == N
  Asp == D
  Cys == C
  Glu == E
  Gln == Q
  Gly == G
  His == H
  Ile == I
  Leu == L
  Lys == K
  Met == M
  Phe == F
  Pro == P
  Ser == S
  Thr == T
  Trp == W
  Tyr == Y
  Val == V")[, .(abbr3 = V1, abbr1 = V3)] 

# split the column
patt = "(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)"
dat[, c("AA1", "num", "AA2") := tstrsplit(AA_change, patt, perl=TRUE)]

# substitute for each part
dat[abbrDT, on=.(AA1 = abbr3), AA1 := abbr1]
dat[abbrDT, on=.(AA2 = abbr3), AA2 := abbr1]

which gives

   chr location           effect   impact AA_change AA1 num AA2
1:   1    12543 missense_variant MODERATE  Ala12Val   A  12   V
2:   1    52367 missense_variant MODERATE  Leu54Pro   L  54   P
3:   1   752347 missense_variant MODERATE  Met99Ser   M  99   S
4:   1   984645 missense_variant MODERATE  Lys34Ile   K  34   I
5:   1   989845 missense_variant MODERATE   Arg4Cys   R   4   C
6:   1   999854 missense_variant MODERATE  His43Gly   H  43   G
7:   1   999855 missense_variant MODERATE  Glu14Phe   E  14   F

Optionally, combine the columns again and remove unneeded columns:

dat[, AA_change := paste0(AA1, num, AA2)]

dat[, c("AA1", "num", "AA2") := NULL]
Frank
  • 66,179
  • 8
  • 96
  • 180