1

I have a dataframe called SNP with 6000 columns and 500 rows. The dataframe looks like the following table:

|sample         |   SNP1   |   SNP2   |  SNP3    |     
|:--------------|---------:|---------:|---------:|
|s1             |   AA     |   TT     |   GG     |   
|s2             |   CC     |   AT     |   AA     |   
|s3             |   AT     |   CC     |   AA     | 

I have to rearrange the format for analysis. What I'd like to do is for each character (I mean "AA", "CC", "AT" for 2nd column, SNP1, for example) in every SNP column, I want to make 2 consecutive rows for each sample and then want to split the characters and assigned them into the newly created two rows for a sample. For your convenience, the below table format I want to make it:

|sample         |   SNP1   |   SNP2   |  SNP3    |     
|:--------------|---------:|---------:|---------:|
|s1             |   A      |   T      |   G      |   
|s1             |   A      |   T      |   G      |   
|s2             |   C      |   A      |   A      | 
|s2             |   C      |   T      |   A      |
|s3             |   A      |   C      |   A      |
|s3             |   T      |   C      |   A      |

It will be highly appreciated if someone can help me to solve.

N.B: I have added a screenshot of datafram in case the format I added is displaced, I am new here.

Cettt
  • 11,460
  • 7
  • 35
  • 58
AFI
  • 13
  • 3

2 Answers2

1

Here is one possible solution using the tidyverse package and regular expressions:

library(tidyverse)

snp <- data.frame(sample = c("s1", "s2", "s3"),
                  SNP1 = c("AA", "CC", "AT"),
                  SNP2 = c("TT", "AT", "CC"),
                  SNP3 = c("GG", "AA", "AA"))


snp %>% mutate_at(-1, ~str_extract(.x, "^.")) %>%
  bind_rows(mutate_at(snp, -1, ~str_extract(.x, ".$"))) %>%
  arrange(sample)

  sample SNP1 SNP2 SNP3
1     s1    A    T    G
2     s1    A    T    G
3     s2    C    A    A
4     s2    C    T    A
5     s3    A    C    A
6     s3    T    C    A

Short explanation: first I extract the first letter (^.) of all columns except the first (hence -1). Then I extract the last letter (.$) of all columns except the first. Finally I paste these two data.frames together and rearrange the result using arrange.

Cettt
  • 11,460
  • 7
  • 35
  • 58
1

Here's a spiffy data.table solution.

You may find this post to be instructive. stack overflow example illustrating tstrsplit

require(data.table)
df = data.table(sample=c("s1","s2","s3"),
   SNP1=c("AA","CC","AT"),
   SNP2=c("TT","AT","CC"),
   SNP3=c("GG","AA","AA"))


# get the names of the relevant columns

x=grep("SNP",names(df),value = TRUE)

# split each column.  The 'by' functionality will 
# manage the stacking of the results.

df[,lapply(.SD,tstrsplit,'',fixed=TRUE),.SDcols=x,by=sample]

output:

 |sample |SNP1 |SNP2 |SNP3 |
 |:------|:----|:----|:----|
 |s1     |A    |T    |G    |
 |s1     |A    |T    |G    |
 |s2     |C    |A    |A    |
 |s2     |C    |T    |A    |
 |s3     |A    |C    |A    |
 |s3     |T    |C    |A    |

If the strings were a fixed length but not necessarily of length 2, this still seems to work. I did NOT test this on NULL strings. caveat emptor

Michael Tuchman
  • 332
  • 3
  • 12