dcast then concatenate transformed variables

Question

I have a dataframe ("df") of two variables, x and y:

a <- c(1:2000)
b <- c(1:4000)
c <- 5000

df <- data.frame(x = sample(a, c, replace = TRUE),
                 y = as.character(sample(b, c, replace = TRUE)))
df <- df[order(df$x), ]

head(df, 10)
     x    y
881  1 2919
4425 1 2000
2478 2 3375
4808 2 3928
4871 2 3351
4889 2 1634
1242 3 3957
1378 3 3356
3029 3 2625
3657 4  646

What I'd like to do now is reduce the dataframe to only contain distinct x variables (e.g., one "1", one "2", one "3", etc.) and have a new field that concatenates each distinct x's y values, separated by commas. The end result would look like this:

head(df3)
  x               multi_ys
1 1             2000, 2919
2 2 1634, 3351, 3375, 3928
3 3       2625, 3356, 3957
4 4              1092, 646
5 5   113, 2430, 3187, 932
6 7                   2349

I have a solution right now that works, but I think it's more cumbersome than necessary. My current-state solution is below and, for a "df" dataframe of 5,000 rows, it's taking about 12 seconds to create "df3".

library(reshape2)

#creates a duplicate field of y, to dcast in the 'multi_y' function below
df$y2 <- df$y

#creates a new dataframe with unique x values
df2 <- df[which(!duplicated(df$x)), ]


multi_y <- function(x) {
  c3.i <- df2[x, 1]
  c3.j <- df[df$x == c3.i, ]
  c3.k <- dcast(c3.j, x ~ y, value.var = "y2")
  cols <- colnames(c3.k)

  #if there are more than two columns in this loop's data frame, then concatenate all columns except the first
  if(ncol(c3.k) > 2) {
    c3.k$cycles <- apply( c3.k[ , cols[-1]] , 1 , paste , collapse = ", " )
  } else {
    c3.k$cycles <- c3.k[, 2]
  }

  c3.l <- cbind(data.frame(c3.k[, 1]), data.frame(c3.k[, ncol(c3.k)]))
  colnames(c3.l) <- c("x", "multi_ys")
  print(c3.l)
}

t <- (1:nrow(df2))
system.time(df3 <- do.call("rbind", lapply(t, function(x) multi_y(x))))

My actual dataframe is over 80,000 rows long and I have to run this type of function 4 different times in my program.

I appreciate any advice you can provide to help me speed up this process.

You can try `library(dplyr); df %>% group_by(x) %>% summarise(y = toString(y))` — Steven Beaupré, May 05 '16 at 13:12

score 2 · Accepted Answer · answered May 05 '16 at 12:59

How about,

df1 <- aggregate(y~x, df, paste, collapse = ',')
head(df1)
#  x                                   y
#1 1                                 542
#2 2                      3813,1220,1666
#3 3 1713,35,643,3957,872,2235,3015,3051
#4 4                      2037,1371,1180
#5 5                            2724,905
#6 6                            293,3248

dcast then concatenate transformed variables

1 Answers1