4

I have a data set of user stats of the following form:

df
   user       date   group
1    X 2017-06-21   S;Y;J
2    Y 2017-06-09 Y;F;P;C
3    R 2017-12-29     K;A
4    Q 2017-08-31     W;I
5    B 2018-01-30   P;M;E

Which can be generated with:

set.seed(10)
n = 5
dates <- seq.Date(as.Date("2017-04-01"), as.Date("2018-05-01"), by=1)
df <- data.frame(user = sample(LETTERS, n, replace=TRUE),
             date = sample(dates, n, replace=TRUE))

df$group <- "A"
for(i in 1:n){
df$group[i] <- paste(sample(LETTERS, sample(1:5, 1, replace=FALSE), 
                       replace=FALSE), collapse=";")
}

I want to split and expand the group column so that it matches the date and user given. For example, User X has interacted with three groups on 2017-06-21, which I'd like to have as three separate entries instead of one. I have code that works for this, but I'm looking for a faster, more R friendly way of replicating this. My current solution is:

# Get the number of groups for each entry
n_groups <- 1 + gsub("[^;]", "", df$group) %>% nchar()
# Get the index for the entries with multiple groups
index <- which(n_groups > 1)
# Get a new vector of dates and users
dates <- integer(sum(n_groups))
class(dates) <- "Date"
users <- vector(mode='character', 
                length = sum(n_groups))

k <- 1
for(i in 1:length(n_groups)){
  for(j in 1:n_groups[i]){
    dates[k] <- df$date[i]
    users[k] <- as.character(df$user[i])
    k <- k + 1
  }
}

df2 <- data.frame(date = dates, user = users,
                  group = unlist(strsplit(df$group, split = ";")))
df2
         date user group
1  2017-06-21    X     S
2  2017-06-21    X     Y
3  2017-06-21    X     J
4  2017-06-09    Y     Y
5  2017-06-09    Y     F
6  2017-06-09    Y     P
7  2017-06-09    Y     C
8  2017-12-29    R     K
9  2017-12-29    R     A
10 2017-08-31    Q     W
11 2017-08-31    Q     I
12 2018-01-30    B     P
13 2018-01-30    B     M
14 2018-01-30    B     E
hubbs5
  • 1,235
  • 1
  • 12
  • 22
  • 2
    An analog of matt-lundberg's base R answer in the linked post is `temp <- strsplit(df$group, split=";"); cbind(df[rep(seq_along(df$user), lengths(temp)),], "newgroup"=unlist(temp))`. This version keeps the original group vector which could be removed with `grep` or statically. – lmo Apr 14 '17 at 11:36

1 Answers1

4
library(dplyr)
library(tidyr)

df2 <- df %>% 
  mutate(group = strsplit(group, split = ";")) %>%
  unnest(group) %>%
  select(date, user, group)
www
  • 38,575
  • 12
  • 48
  • 84
  • 1
    Brilliant! Orders of magnitude faster! – hubbs5 Apr 14 '17 at 11:16
  • 6
    @hubbs5 But, it's not the fastest. I've benchmarked 7 different approaches with sample data of `n = 10000` rows. Fastest was `library(data.table); setDT(df)[, .(group = unlist(strsplit(as.character(group), ";", fixed = TRUE))), by = .(date, user)]` followed by `splitstackshape::cSplit(df, "group", ";", direction = "long")` and lmo's [solution](http://stackoverflow.com/questions/43409756/r-efficiently-separate-groups-in-data-frame#comment73880215_43409756). ycw's solution was 2.5 times slower than the fastest. Unfortunately, I can't post an answer with all details. – Uwe Apr 14 '17 at 23:57
  • 5
    @hubbs5 I've posted [benchmark results](http://stackoverflow.com/a/43431847/3817004) for the dupe target [_Split comma-separated column into separate rows_](http://stackoverflow.com/questions/13773770/split-comma-separated-column-into-separate-rows). For large problem sizes, `data.table` is magnitudes faster than `dplyr`/`tidyr`. – Uwe Apr 15 '17 at 22:34
  • 1
    @Uwe Block Thanks for your thorough benchmark comparison. – www Apr 15 '17 at 22:59