1

I'm working on a solution to clean up some data, but am not 100% sure what the best solution is. I've found a working solution, but would like to know if there was an easier approach (especially in trying to scale it). What I would like to do is separate all the elements in a dataframe (separated by semicolons), apply each of those components to an element of the dataframe, then combine the results into a new dataframe. Example below:

test <- data.frame(class=c("a1", "a2","a3","a4"),
        person=c("p1;p3;p4","p2;p4","p4;p5;p6","p1;p5"),
        stringsAsFactors = F)

test1 <- c()
test2 <- c()

for (i in 1:nrow(test)){
    test1 <-append(test1, strsplit(test[i,2],";")[[1]])
    test2 <- append(test2, rep(test[i,1],length(strsplit(test[i,2],";")[[1]])))
}
thelatemail
  • 91,185
  • 12
  • 128
  • 188
nathanbeagle
  • 47
  • 1
  • 5

3 Answers3

1

This slightly verbose oneliner will do it, if I understand exactly what you're trying to do:

do.call("rbind",apply(test, 1, function(x) expand.grid(x[1], unlist(strsplit(x[2], split=";")))))

   Var1 Var2
1    a1   p1
2    a1   p3
3    a1   p4
4    a2   p2
5    a2   p4
6    a3   p4
7    a3   p5
8    a3   p6
9    a4   p1
10   a4   p5

I am using expand.grid on each row of your original dataframe, including the first variable and the second variable split by ; as the inputs. As the result of apply is a list then I use do.call with rbind to put it back into a dataframe.

jalapic
  • 13,792
  • 8
  • 57
  • 87
1

We can use cSplit from splitstackshape (here i mentioned one package and used one and not included another) to split the 'person' column by the delimiter ; and specify the direction as long to reshape to 'long' format after the split.

library(splitstackshape)
cSplit(test, 'person', ';', 'long')
#    class person
# 1:    a1     p1
# 2:    a1     p3
# 3:    a1     p4
# 4:    a2     p2
# 5:    a2     p4
# 6:    a3     p4
# 7:    a3     p5
# 8:    a3     p6
# 9:    a4     p1
#10:    a4     p5

Or another one-liner from base R (without using any packages)

stack(setNames(strsplit(test$person, ";"), test$class))[2:1]
akrun
  • 874,273
  • 37
  • 540
  • 662
1

With tidyr:

library(dplyr)
library(tidyr)

         # separate person into multiple columns
test %>% separate(person, into = paste0('person', 1:5), fill = 'right') %>% 
    # gather from wide to long
    gather(key = id, value = person, -class, na.rm = TRUE) %>% 
    # clean up extra column
    select(-id)

#    class person
# 1     a1     p1
# 2     a2     p2
# 3     a3     p4
# 4     a4     p1
# 5     a1     p3
# 6     a2     p4
# 7     a3     p5
# 8     a4     p5
# 9     a1     p4
# 11    a3     p6
alistaire
  • 42,459
  • 4
  • 77
  • 117