Split rows in data.frame based on symbol

Question

I would like to split rows that has strings separated by / or ab into new rows. Please see input and output examples below. Maybe we could use strsplit(d$values,split='/', fixed=TRUE), but how do you make them into new rows?

head

> head(d,3)
                    values   ind
54                 miR-197  9846
55                 miR-197  9846
113 miR-221/222/222ab/1928 56204

dput

    structure(list(values = c("miR-197", "miR-197", "miR-221/222/222ab/1928"
), ind = structure(c(4L, 4L, 6L), .Label = c("6482", "4057", 
"60481", "9846", "7414", "56204", "84957", "29924", "6095", "8301", 
"2355", "88455", "23047", "57590", "5829", "162", "3091", "9766", 
"23406", "3646", "22870", "22898", "8775", "8178", "2077", "64115", 
"6158", "5007", "8567", "10019", "26127", "4739", "6678", "27013", 
"6146", "51060", "1997", "847", "4035", "79026", "8192", "5782", 
"1032", "4354", "5791", "2752", "9873", "6386", "5962", "2230", 
"6938", "6727", "7090", "92912", "55784", "409", "23521", "6279", 
"51312", "7357", "2040", "2934", "9219", "2180", "219333", "114908", 
"50807", "90268", "3098", "1974", "56990", "7791", "162989", 
"9159", "7086", "51762", "9318", "23582", "10632", "54815", "1938", 
"3576", "11214", "167227", "156", "2745", "6138", "391", "10933", 
"5501", "3638", "2316", "2869", "10527", "255809"), class = "factor")), row.names = c(54L, 
55L, 113L), class = "data.frame")

expected output

values   ind
miR-197  9846
miR-197  9846
miR-221 56204
222a    56204
222b    56204
1928    56204

oops..I think I marked it as duplicate in a hurry. What happens to this part `222/222ab` ? What is the logic of splitting there? — Ronak Shah, Oct 15 '18 at 09:56
Probably isn't such a tidy function for it, you can use `grep` and copy the rows with "ab", then use `gsub` to change the copy to "b", then change the original to "a", `rbind` then `order`. — Aaron Hayman, Oct 15 '18 at 09:58
The logic is some biological genes that are concatenate as "ab" to save space in the data.frame. They should be 222a and 222b — user2300940, Oct 15 '18 at 10:00
Ok. Reopened the post. `separate_rows(df, values, sep = "\\/|a|b")` is another attempt but it is not what your expected output is. — Ronak Shah, Oct 15 '18 at 10:04

score 2 · Accepted Answer · answered Oct 15 '18 at 10:25

Shamelessly stealing Ronak Shah's initial solution. Then splitting the 'ab'.

This solution is not generally applicable to all letters, so attention and modification would need to be used if trying to scale this to a larger problem with letters beyond "a" and "b".

First we add an index column to help with ordering, which will be removed at the end.

df <- tidyr::separate_rows(df, values, sep = "\\/")
df$index <- seq_len(nrow(df))
df_ab <- df[grep('ab',df$values),]
df$values <- gsub("ab","a",df$values)
df_ab$values <- gsub("ab","b",df_ab$values)
df <- rbind(df,df_ab)
df <- df[order(df$index),]
df$index <- NULL

##     values   ind
## 1  miR-197  9846
## 2  miR-197  9846
## 3  miR-221 56204
## 4      222 56204
## 5     222a 56204
## 51    222b 56204
## 6     1928 56204

Split rows in data.frame based on symbol

1 Answers1