2

I am facing a tricky question and would be glad to have some help.

I have a data frame with an ID name taking different structures. Something like this following :

ID
bbb-5p/mi-98/6134
abb-4p
bbb-5p/mi-98

Every time I have this "/" I would like to duplicate the row. Each row should be duplicated the number of time we find this "/". Then the name of the duplicated row should be the root + the characters right after the "/".

For exemple this :

ID
bbb-5p/mi-98/6134

should give :

ID
bbb-5p
bbb-5p-mi-98
bbb-5p-6134

Also my initial data frame have 5 variables :

  [ID, varA, varB, varC, varD]

And every time I have this "/" I would like to duplicate the entire row. Then I am expecting to have a new data frame with something like

  newID         newvarA  newvarB  newvarC  newvarD   
  bbb-5p        varA(1)  varB(1)  varC(1)  varD(1)
  bbb-5p-mi-98  varA(1)  varB(1)  varC(1)  varD(1)
  bbb-5p-6134   varA(1)  varB(1)  varC(1)  varD(1)
  abb-4p        varA(2)  varB(2)  varC(2)  varD(2)
  bbb-5p        varA(3)  varB(3)  varC(3)  varD(3)
  bbb-5p-mi-98  varA(3)  varB(3)  varC(3)  varD(3)

Any idea? Thank you in advance

Peter

Peter
  • 133
  • 1
  • 11
  • 1
    This is a duplicate question, see http://stackoverflow.com/questions/30818840/in-r-get-multiple-rows-by-splitting-a-column-using-tidyr-and-reshape2 and http://stackoverflow.com/questions/13773770/split-comma-separated-column-into-separate-rows – Iaroslav Domin Dec 13 '16 at 14:05
  • @akrun this is not an exact dupe, the question here involves more than just splitting strings and stacking them vertically.. – mtoto Dec 13 '16 at 14:26
  • @mtoto Ohh, okay, reopened it. I thought the two links provided have some links to this one. Sorry – akrun Dec 13 '16 at 14:29

3 Answers3

1

You can accomplish this in base R, using lapply() with a custom function. First, you split your character column on "/", resulting in a list of vectors:

l <- strsplit(df$ID,"/")

Then you apply a user defined function to each element of l using lapply():

l_stacked <- lapply(l, function(x) 
          if(length(x) > 1) { 
          c(x[1], paste0(x[1],"-",x[-1])) } 
          else { x })

The function first checks whether the vector has a length > 1. If so, it concatenates all elements with the first element, separated by "-". If length <= 1, it means the string didn't contain "/", hence it is returned as is. Finally we flatten our output using unlist() to be able to convert to data.frame.

data.frame(ID = unlist(l_stacked))
#            ID
#1       bbb-5p
#2 bbb-5p-mi-98
#3  bbb-5p-6134
#4       abb-4p
#5       bbb-5p
#6 bbb-5p-mi-98
mtoto
  • 23,919
  • 4
  • 58
  • 71
  • Thank you for your answer ! Sounds like this code do the job... However my point is how to link this new ID with the others variables of my data.frame? – Peter Dec 14 '16 at 13:13
  • can you elaborate please? this is not stated in your original question. – mtoto Dec 14 '16 at 13:15
  • Let's say my initial data frame have 5 variables [ID, varA, varB, varC, varD]. What I mean is, every time I have this "/" I would like to duplicate the row. But keep the entire row. Then I am expecting to have a new data frame with something like [newID(1), varA(1), varB(1), varC(1), varD(1)] ..... [newID(n), varA(n), varB(n), varC(n), varD(n)]... Is that more clear? – Peter Dec 14 '16 at 13:33
0

One way to achieve this is the following:

library(dplyr)
library(tidyr)
res <- df %>% mutate(i=row_number(),
                     ID = strsplit(ID,split='/')) %>%
              unnest() %>% 
              group_by(i) %>%
              mutate(ID=ifelse(ID==first(ID),first(ID),paste(first(ID),ID,sep='-'))) %>%
              ungroup() %>% select(-i)
### A tibble: 6 x 1
##            ID
##         <chr>
##1       bbb-5p
##2 bbb-5p-mi-98
##3  bbb-5p-6134
##4       abb-4p
##5       bbb-5p
##6 bbb-5p-mi-98

Notes:

  1. First, create an indexing column i to group by later so that we can group each "root".
  2. Use strsplit to split each row by "|".
  3. tidyr::unnest the result to separate rows.
  4. group_by the created index i and then if the row is the first row, just return the root; otherwise, paste to prepend the root to the row with separator "-".
  5. Finally, ungroup and remove the created index column i.

Data

df <- structure(list(ID = c("bbb-5p/mi-98/6134", "abb-4p", "bbb-5p/mi-98"
)), .Names = "ID", row.names = c(NA, -3L), class = "data.frame")
                 ID
1 bbb-5p/mi-98/6134
2            abb-4p
3      bbb-5p/mi-98
aichao
  • 7,375
  • 3
  • 16
  • 18
0

Here is one option using data.table. Convert the 'data.frame' to 'data.table' (setDT(df1, ..)) and create a column of rownames, grouped by 'rn', split the 'ID' by /, loop through the sequence of rows, paste the split elements based on the index.

library(splitstackshape)
library(data.table)
setDT(df1, keep.rownames=TRUE)[, unlist(strsplit(ID, "/")), 
         by = rn][, .(ID=sapply(seq_len(.N), function(i) 
             paste(V1[unique(c(1,i))], collapse="-"))) , rn]

Or an option with dplyr/tidyr/tibble. Create the rownames column with tibble::rownames_to_column, separate the rows into long format with separate_rows, grouped by 'rn', we mutate the 'ID' by pasteing the elements based on the condition of length and remove the 'rn' column.

library(dplyr)
library(tidyr)
library(tidyr)
rownames_to_column(df1, var = "rn") %>% 
         separate_rows(ID, sep="/") %>%
         group_by(rn)  %>%
         mutate(ID = if(n()>1) c(ID[1], paste(ID[1], ID[-1], sep="-")) else ID) %>%
         ungroup() %>%
         select(-rn)
 #         ID
 #        <chr>
 #1       bbb-5p
 #2 bbb-5p-mi-98
 #3  bbb-5p-6134
 #4       abb-4p
 #5       bbb-5p 
 #6 bbb-5p-mi-98
akrun
  • 874,273
  • 37
  • 540
  • 662