split dataframe with multiple delimiters in R

Question

df1 <- 
     Gene             GeneLocus 
    CPA1|1357       chr7:130020290-130027948:+     
    GUCY2D|3000     chr17:7905988-7923658:+   
    UBC|7316        chr12:125396194-125399577:-            
    C11orf95|65998  chr11:63527365-63536113:-        
    ANKMY2|57037    chr7:16639413-16685398:-

expected output

df2 <- 
     Gene.1   Gene.2             chr     start     end 
    CPA1      1357               7     130020290 130027948   
    GUCY2D    3000               17      7905988   7923658  
    UBC       7316               12    125396194 125399577          
    C11orf95  65998              11     63527365  63536113     
    ANKMY2    57037               7     16639413  16685398]]

I tried this way..

install.packages("splitstackshape")
library(splitstackshape)
df1 <- cSplit(df1,"Gene", sep="|", direction="wide", fixed=T)
df1 <- cSplit(df1,"GeneLocus",sep=":",direction="wide", fixed=T)
df1 <- cSplit(df1,"GeneLocus_2",sep="-",direction="wide", fixed=T)
df1 <- data.frame(df1)
df2$GeneLocus_1 <- gsub("chr","", df1$GeneLocus_1)

I would like to know if there is any other alternative way to do it in simpler way

Looks quite straightforward to me. You could also have a look at `separate` from the `tidyr` package: http://stackoverflow.com/questions/27237835/how-to-strsplit-different-number-of-strings-in-certain-column-by-do-function-dp. — Paul Hiemstra, Sep 22 '15 at 13:43
You can use `DT <- cSplit(df1, 1:2, '[+|:-]', fixed=FALSE)[, c(1:2, 5:7), with=FALSE][, GeneLocus_1:=as.numeric(sub('[a-z]+', '', GeneLocus_1))]` and then change the column names. — akrun, Sep 22 '15 at 13:58

score 2 · Accepted Answer · answered Sep 22 '15 at 13:49

Here you go...Just ignore the warning that does not affect the output; it actually has the side effect of removing the strand information (:+ or :-).

library(tidyr)
library(dplyr)
df1 %>% separate(Gene, c("Gene.1","Gene.2")) %>% separate(GeneLocus, c("chr","start","end")) %>% mutate(chr=sub("chr","",chr))

Output:

    Gene.1 Gene.2 chr     start       end
1     CPA1   1357   7 130020290 130027948
2   GUCY2D   3000  17   7905988   7923658
3      UBC   7316  12 125396194 125399577
4 C11orf95  65998  11  63527365  63536113
5   ANKMY2  57037   7  16639413  16685398

seems perfect answer for what OP wanted. But I will probably do `df1 %>% separate(Gene, c("Gene.1","Gene.2")) %>% separate(GeneLocus, c("chr","start","end","strand"), "[|:-]") %>% mutate(chr=sub("chr","",chr))%>%mutate(strand=ifelse(strand=="+","+","-"))` just to keep strand info. (as a genomics student) — Ananta, Sep 22 '15 at 13:59

A5C1D2H2I1M1N2O1R2T1 · Answer 2 · 2016-03-18T16:37:05.677

I would suggest something like the following approach:

Make a single delimiter in your "GeneLocus" column (and strip out the unnecessary parts while you're at it).
Split both columns at once. Note that cSplit "balances" the columns being split according to the number of output columns detected. Thus, since the first column would only result in 2 columns when split, but the second would result in 4, you would need to drop columns 3 and 4 from the result.

library(splitstackshape)

GLPat <- "^chr(\\d+):(\\d+)-(\\d+):([+-])$"
cSplit(as.data.table(mydf)[, GeneLocus := gsub(
  GLPat, "\\1|\\2|\\3|\\4", GeneLocus)], names(mydf), "|")[
    , 3:4 := NULL, with = FALSE][]
#      Gene_1 Gene_2 GeneLocus_1 GeneLocus_2 GeneLocus_3 GeneLocus_4
# 1:     CPA1   1357           7   130020290   130027948           +
# 2:   GUCY2D   3000          17     7905988     7923658           +
# 3:      UBC   7316          12   125396194   125399577           -
# 4: C11orf95  65998          11    63527365    63536113           -
# 5:   ANKMY2  57037           7    16639413    16685398           -

Alternatively, you can try col_flatten from my "SOfun" package, with which you can do:

library(SOfun)

Pat <- "^chr(\\d+):(\\d+)-(\\d+):([+-])$"
Fun <- function(invec) strsplit(gsub(Pat, "\\1|\\2|\\3|\\4", invec), "|", TRUE)

col_flatten(as.data.table(mydf)[, lapply(.SD, Fun)], names(mydf), drop = TRUE)
#      Gene_1 Gene_2 GeneLocus_1 GeneLocus_2 GeneLocus_3 GeneLocus_4
# 1:     CPA1   1357           7   130020290   130027948           +
# 2:   GUCY2D   3000          17     7905988     7923658           +
# 3:      UBC   7316          12   125396194   125399577           -
# 4: C11orf95  65998          11    63527365    63536113           -
# 5:   ANKMY2  57037           7    16639413    16685398           -

SOfun is only on GitHub, so you can install it with:

source("http://news.mrdwab.com/install_github.R")
install_github("mrdwab/SOfun")

split dataframe with multiple delimiters in R

2 Answers2