Splitting data of a column and placing them into two consecutive rows for the same observation

Question

I have a dataframe called SNP with 6000 columns and 500 rows. The dataframe looks like the following table:

|sample         |   SNP1   |   SNP2   |  SNP3    |     
|:--------------|---------:|---------:|---------:|
|s1             |   AA     |   TT     |   GG     |   
|s2             |   CC     |   AT     |   AA     |   
|s3             |   AT     |   CC     |   AA     |

I have to rearrange the format for analysis. What I'd like to do is for each character (I mean "AA", "CC", "AT" for 2nd column, SNP1, for example) in every SNP column, I want to make 2 consecutive rows for each sample and then want to split the characters and assigned them into the newly created two rows for a sample. For your convenience, the below table format I want to make it:

|sample         |   SNP1   |   SNP2   |  SNP3    |     
|:--------------|---------:|---------:|---------:|
|s1             |   A      |   T      |   G      |   
|s1             |   A      |   T      |   G      |   
|s2             |   C      |   A      |   A      | 
|s2             |   C      |   T      |   A      |
|s3             |   A      |   C      |   A      |
|s3             |   T      |   C      |   A      |

It will be highly appreciated if someone can help me to solve.

N.B: I have added a screenshot of datafram in case the format I added is displaced, I am new here.

I first tried with separate function of tidyverse package but not worked, so I asked help. — AFI, Oct 17 '19 at 17:39

score 1 · Accepted Answer · answered Oct 17 '19 at 15:26

Here is one possible solution using the tidyverse package and regular expressions:

library(tidyverse)

snp <- data.frame(sample = c("s1", "s2", "s3"),
                  SNP1 = c("AA", "CC", "AT"),
                  SNP2 = c("TT", "AT", "CC"),
                  SNP3 = c("GG", "AA", "AA"))


snp %>% mutate_at(-1, ~str_extract(.x, "^.")) %>%
  bind_rows(mutate_at(snp, -1, ~str_extract(.x, ".$"))) %>%
  arrange(sample)

  sample SNP1 SNP2 SNP3
1     s1    A    T    G
2     s1    A    T    G
3     s2    C    A    A
4     s2    C    T    A
5     s3    A    C    A
6     s3    T    C    A

Short explanation: first I extract the first letter (^.) of all columns except the first (hence -1). Then I extract the last letter (.$) of all columns except the first. Finally I paste these two data.frames together and rearrange the result using arrange.

Michael Tuchman · Answer 2 · 2019-10-17T16:25:43.557

Here's a spiffy data.table solution.

You may find this post to be instructive. stack overflow example illustrating tstrsplit

require(data.table)
df = data.table(sample=c("s1","s2","s3"),
   SNP1=c("AA","CC","AT"),
   SNP2=c("TT","AT","CC"),
   SNP3=c("GG","AA","AA"))


# get the names of the relevant columns

x=grep("SNP",names(df),value = TRUE)

# split each column.  The 'by' functionality will 
# manage the stacking of the results.

df[,lapply(.SD,tstrsplit,'',fixed=TRUE),.SDcols=x,by=sample]

output:

 |sample |SNP1 |SNP2 |SNP3 |
 |:------|:----|:----|:----|
 |s1     |A    |T    |G    |
 |s1     |A    |T    |G    |
 |s2     |C    |A    |A    |
 |s2     |C    |T    |A    |
 |s3     |A    |C    |A    |
 |s3     |T    |C    |A    |

If the strings were a fixed length but not necessarily of length 2, this still seems to work. I did NOT test this on NULL strings. caveat emptor

My original solution melted and recasted the data table. – Michael Tuchman Oct 17 '19 at 16:09 — Michael Tuchman, Oct 17 '19 at 16:09

Splitting data of a column and placing them into two consecutive rows for the same observation

2 Answers2