0

I have geno data 40000 rows (SNPs) and 500 columns (humans) looking like

AA AG GG GA AA
CC CG CC GC GG
AC CC CA CA CC

Example presenting only 3 SNPs and 5 humans.

I need to convert letters to numbers using keys presented next. Note that Three letters A, C and G can not occur in one row. Only A and C or A and G, or C and G.

If A presented within row, key is:

AA = 0
AG =1
GG = 2
AC = 1
CC = 2

, if A is not presented, key is:

CC = 0 
CG = 1 
GG = 2

Notice that CC in one case is 2 in other case is 0.

So example will look like:

0 1 2 1 0
0 1 0 1 2
1 2 1 1 2

How to do it in R for all rows and columns?

Thank you!

Zoomman
  • 37
  • 5
  • Hi, it would help a lot to have some sample of your date in a format that one could directly paste into R, like suggested here: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – snaut Nov 02 '18 at 12:30
  • 2
    m<- matrix(c("AA", "AG", "GG" ,"GA", "AA", "CC", "CG", "CC", "GC", "GG", "AC", "CC", "CA", "CA", "CC"), 3,5,byrow =T) – Zoomman Nov 02 '18 at 12:35

1 Answers1

2

There are many ways to solve this I would create an index vector for the rows that contain A first, and then apply the replacements on the different rows using the recode function of the dplyr package.

# Creating the Matrix
X <- matrix(
  c("AA", "AG", "GG", "GA", "AA",
    "CC", "CG", "CC", "GC", "GG",
    "AC", "CC", "CA", "CA", "CC"), byrow=TRUE, nrow=3)

# Index
index_a <- apply(X, 1, function(i){
  any(grepl("A",i))
  })

# NA matrix for the result
Y <- matrix(NA_integer_, nrow(X), ncol(X))

# First replacement
Y[index_a, ] <- dplyr::recode(
  X[index_a, ],
  AA = 0L,
  AG = 1L,
  GG = 2L,
  AC = 1L,
  CC = 2L,
  GA = 1L,
  CA = 1L
)

# Second replacement
Y[!index_a, ] <- dplyr::recode(
  X[!index_a, ],
  CC = 0L, 
  CG = 1L, 
  GG = 2L,
  GC = 1L
)
snaut
  • 2,261
  • 18
  • 37