3

I have a data frame in a very simple form:

    X Y
    ---
    A 1
    A 2
    B 3
    C 1
    C 3

My end result should be a list like this:

$`A`
[1] 1 2

$`B`
[1] 3

$`C`
[1] 1 3

For this operation I am using the split() function in R:

k <- split(Y, X)

This is working just fine. However, if I want to apply this code on a data frame containing 22 million rows including 10 million groups for X and 387000 values for Y it becomes really time consuming. I tried using the RRO 8.0 open version for MKL support. However, still only one Kernel is used. The CPU has 64 GB of RAM so that shouldn't be an issue.

Any ideas for a smarter way to compute this?

Daniel Schultz
  • 320
  • 2
  • 13
  • 2
    Wouldn't these operations can be carried out using `data.table` or `dplyr` – akrun Dec 04 '14 at 15:34
  • I tried using `dplyr` but couldn't figure out a way to do it. In any case wouldn't `dplyr` return a data frame? I think a list structure would be more comfortable for post processing. – Daniel Schultz Dec 04 '14 at 15:37
  • 1
    You can return a column as list in `data.table` and also in `dplyr` (with `do`) – akrun Dec 04 '14 at 15:38
  • I would be happy to use `dplyr`. In an attempt I used the `group_by` and `summarise` functions, but couldn't figure out the best way to do it. – Daniel Schultz Dec 04 '14 at 15:41
  • 1
    What are you planning to do with the list? Maybe you can avoid creating it alltogether – talat Dec 04 '14 at 15:42
  • I want to perform a market basket analysis using the `apriori` function from the `arules` package. – Daniel Schultz Dec 04 '14 at 15:50
  • `?apriori` takes as input an "object of class transactions or any data structure which can be coerced into transactions (e.g., a binary matrix or data.frame)." So you could probably work without conversion to list. However, I have no experience with that package. – talat Dec 04 '14 at 15:55
  • Good luck. We all share the same issue with `split` – Rich Scriven Dec 04 '14 at 15:58
  • @beginneR I could actually try, indeed. – Daniel Schultz Dec 05 '14 at 12:14
  • @beginneR I tried using the data.frame as `apriori` doesn't need a list on a smaller test data frame. However, the results do not correlate by one. When I use the data frame, the `as` command finds the number of rows as transactions. While obviously the list finds the number of groups as transactions. The second way is the way I want it to be. I would be happy to get rid of the list due to time, but I don't see how. The first tests weren't promising. – Daniel Schultz Dec 05 '14 at 13:30

2 Answers2

5

Try

 library(data.table)
 DT <- as.data.table(df)
 DT1 <- DT[, list(Y=list(Y)), by=X]
 DT1$Y
 #[[1]]
 #[1] 1 2

 #[[2]]
 #[1] 3

 #[[3]]
 #[1] 1 3

Or using dplyr

 library(dplyr)
 df1 <-  df %>% 
             group_by(X) %>%
              do(Y=c(.$Y))

 df1$Y
 #[[1]]
 #[1] 1 2

 #[[2]]
 #[1] 3

 #[[3]]
 #[1] 1 3

data

 df <- structure(list(X = c("A", "A", "B", "C", "C"), Y = c(1L, 2L, 
 3L, 1L, 3L)), .Names = c("X", "Y"), class = "data.frame", row.names = c(NA, 
 -5L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    I'm a dplyr fan, but I noticed that using `dplyr::do` can be much slower than the rest of dplyr, unfortunately. – talat Dec 04 '14 at 15:53
  • @beginneR Thanks for the comment. I think `dplyr` is not intended (at present) for list operations. – akrun Dec 04 '14 at 15:54
  • I will try using `data.table` then. I did not work with the package yet since I thought performance-wise `dplyr` and `data.table` are interchangeable. I liked the coding in `dplyr` better so far – Daniel Schultz Dec 04 '14 at 15:58
  • @DanielSchultz From some of the benchmarks, I think for big datasets, `data.table` is faster. – akrun Dec 04 '14 at 16:03
  • I am trying `data.table` code now. The `dplyr`code pushed the time down to 3 hours. Let's see what `data.table` can do. – Daniel Schultz Dec 04 '14 at 16:19
  • @DanielSchultz Thanks, that will be some benchmarking – akrun Dec 04 '14 at 16:19
  • @akrun So I tried the the `data.table` code. I run the code without an error but no new data frame appears. I run the code on a test data set containing a 200 rows sub-sample, everything works as it should. Since I don't get an error message and the code works on a smaller data table I don't know what to change. – Daniel Schultz Dec 05 '14 at 13:39
  • @DanielSchultz Have you assigned it to a new object just like I showed? I am not sure what is happening there. As data.table is designed for bigger datasets, this looks odd. – akrun Dec 05 '14 at 14:40
  • @akrun I discovered another oddity: My Variable in column 2 is a factor not numeric. I thought that shouldn't be a problem. However, in the sub-sample with 200 rows, all column 2 vectors are the same. There is no variance. I cannot imagine how that happens. The original data table looks fine though – Daniel Schultz Dec 05 '14 at 14:53
  • @DanielSchultz When I tested this on the example `df` it appears to be `factor` (after I changed Y to factor). `str(DT1) Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables: $ X: chr "A" "B" "C" $ Y:List of 3 ..$ : Factor w/ 3 levels "1","2","3": 1 2 ..$ : Factor w/ 3 levels "1","2","3": 3 ..$ : Factor w/ 3 levels "1","2","3": 1 3` – akrun Dec 05 '14 at 14:56
  • @DanielSchultz I am using R 3.1.2. and `data.table_1.9.5` – akrun Dec 05 '14 at 14:57
  • @akrun I am using the same versions – Daniel Schultz Dec 05 '14 at 15:04
  • @DanielSchultz So, i guess this type of oddity occurs only on the full dataset, and not on any of the subsets, right? – akrun Dec 05 '14 at 15:05
  • @akrun If I recode Y to factor in the minimal example I can actually recreate the error: `> DT1 X Y 1: A 1,3 2: B 1,3 3: C 1,3` That is clearly not what we intended ;) – Daniel Schultz Dec 05 '14 at 15:06
  • @akrun I don't get any results for the full data set. The oddity occurs in the sub sample and in the minimal example I created for stackoverflow. – Daniel Schultz Dec 05 '14 at 15:08
  • @DanielSchultz I couldn't get that error somehow. I am using the devel version. – akrun Dec 05 '14 at 15:08
  • @DanielSchultz May be you can contact the authors directly or post a new question with this oddity so that more people will look into it. – akrun Dec 05 '14 at 15:10
  • @DanielSchultz I do really like to help you with this one, but without reproducing that, I can't look at where the problem is. – akrun Dec 05 '14 at 15:12
2

I found an elegant solution using similar code from dplyr and/or data.table. I looked for concatenate groups in R and I found this post:

Efficiently concate character content within one column, by group in R

And actually, it works quite nicely with

dt = data.table(content = sample(letters, 26e6, T), groups = LETTERS)
df = as.data.frame(dt)

system.time(dt[, paste(content, collapse = " "), by = groups])
#   user  system elapsed 
#   5.37    0.06    5.65 

system.time(df %>% group_by(groups) %>% summarise(paste(content, collapse = " ")))
#   user  system elapsed 
#   7.10    0.13    7.67 

Thanks for all your help

Community
  • 1
  • 1
Daniel Schultz
  • 320
  • 2
  • 13