2

I have a data.table with more than 130 000 rows.

I wanted to group two cols : dates and progress by a variable id and put the values in a vector so I used aggregate().

df_agr <- aggregate(cbind(progress, dates) ~ id, data = df_test, FUN = c)

However it takes around 52 seconds to aggregate the data + I lose the date format from the col dates.

An example of the dataframe :

            id      dates progress
 1:  3505H6856 2003-07-10      yes
 2:  3505H6856 2003-08-21      yes
 3:  3505H6856 2003-09-04      yes
 4:  3505H6856 2003-10-16      yes
 5: 3505H67158 2003-01-14      yes
 6: 3505H67158 2003-02-18      yes
 7: 3505H67862 2003-03-06      yes
 8: 3505H62168 2003-04-24       no
 9: 3505H62168 2003-05-15      yes
10: 3505H65277 2003-02-11      yes

The result I get :

          id   progress       dates
1 3505H62168       1, 2        5, 6
2 3505H65277          2           2
3 3505H67158       2, 2        1, 3
4 3505H67862          2           4
5  3505H6856 2, 2, 2, 2        7, 8, 9, 10

I was quite surprised to see that everything is converting into an integer + each row who seems to contain "independent" vectors are, in fact, vectors from a list :

'data.frame':   5 obs. of  3 variables:
 $ id      : chr  "3505H62168" "3505H65277" "3505H67158" "3505H67862" ...
 $ progress:List of 5
  ..$ 1: int  1 2
  ..$ 2: int 2
  ..$ 3: int  2 2
  ..$ 4: int 2
  ..$ 5: int  2 2 2 2
 $ dates   :List of 5
  ..$ 1: int  5 6
  ..$ 2: int 2
  ..$ 3: int  1 3
  ..$ 4: int 4
  ..$ 5: int  7 8 9 10

I tried to convert back the dates in the right format with :

lapply(df_agr$dates, function(x) as.Date(x, origin="1970-01-01"))

but I got :

$`1`
[1] "1970-01-06" "1970-01-07"

$`2`
[1] "1970-01-03"

$`3`
[1] "1970-01-02" "1970-01-04"

$`4`
[1] "1970-01-05"

$`5`
[1] "1970-01-08" "1970-01-09" "1970-01-10" "1970-01-11"

So it seems the origin is not "1970-01-01" as it's written in the documentation, maybe the lowest date from the data ?

So my question is : how to get the same result I got with aggregate() with data.table while keeping the date format ?

So it means how to group by unique id with data.table. I tried :

setDT(df)[,list(col1 = c(progress), col2 = c(dates)), by = .(unique(id))]

But of course I got the followed error :

Error in [.data.table(df, , list(col1 = c(progress), col2 = c(dates)), : The items in the 'by' or 'keyby' list are length (5). Each must be same length as rows in x or number of rows returned by i (10).

Data :

structure(list(id = c("3505H6856", "3505H6856", "3505H6856", 
"3505H6856", "3505H67158", "3505H67158", "3505H67862", "3505H62168", 
"3505H62168", "3505H65277"), dates = structure(c(12243, 12285, 
12299, 12341, 12066, 12101, 12117, 12166, 12187, 12094), class = "Date"), 
    progress = c("yes", "yes", "yes", "yes", "yes", "yes", "yes", 
    "no", "yes", "yes")), .Names = c("id", "dates", "progress"
), class = c("data.frame"), row.names = c(NA, -10L
))
Omlere
  • 263
  • 1
  • 2
  • 10

2 Answers2

2

You can use paste0 I think as below, you need to change the date to character so that it doesn't coverted to its numeric counterpart, running below query without converting dates to numeric will give you values like, 12166, 12187. In your query you are also using "c" as to combine the objects, however we should use paste to combine, also in data.table when you use .(id) in by it should give you unique values on by items unless your query have something which is not making things unique for example in this case if you avoid the collapse statement you won't get the unique keys on ID, I hope this is helpful. Thanks:

df_agr <- aggregate(cbind(progress, as.character(dates)) ~ id, data = df, FUN = paste0)

> df_agr
          id           progress                                             V2
1 3505H62168            no, yes                         2003-04-24, 2003-05-15
2 3505H65277                yes                                     2003-02-11
3 3505H67158           yes, yes                         2003-01-14, 2003-02-18
4 3505H67862                yes                                     2003-03-06
5  3505H6856 yes, yes, yes, yes 2003-07-10, 2003-08-21, 2003-09-04, 2003-10-16
>

Using data.table:

setDT(df)[,.(paste0(progress,collapse=","), paste0(as.character(dates),collapse=",")), by = .(id)]


           id              V1                                          V2
1:  3505H6856 yes,yes,yes,yes 2003-07-10,2003-08-21,2003-09-04,2003-10-16
2: 3505H67158         yes,yes                       2003-01-14,2003-02-18
3: 3505H67862             yes                                  2003-03-06
4: 3505H62168          no,yes                       2003-04-24,2003-05-15
5: 3505H65277             yes                                  2003-02-11

OR just pointed out by David Arenberg, much easier way in data.table is, Thanks for valuable comments:

setDT(df)[, lapply(.SD, toString), by = id]
PKumar
  • 10,971
  • 6
  • 37
  • 52
  • 1
    Thank you so much. It works perfectly it only takes around 2-3 seconds with `data.table` instead of 52 seconds with `aggregate` – Omlere May 10 '17 at 11:14
  • 1
    @DavidArenburg Thanks for your feedback and comments , added in the solution. – PKumar May 10 '17 at 12:45
0

A dplyr version.

library(dplyr)
df %>% 
   group_by(id) %>%
   summarize (progress = paste(progress, collapse=","),
              dates = paste(dates, collapse=",") )

#          id        progress                                       dates
#        <chr>           <chr>                                       <chr>
# 1 3505H62168          no,yes                       2003-04-24,2003-05-15
# 2 3505H65277             yes                                  2003-02-11
# 3 3505H67158         yes,yes                       2003-01-14,2003-02-18
# 4 3505H67862             yes                                  2003-03-06
# 5  3505H6856 yes,yes,yes,yes 2003-07-10,2003-08-21,2003-09-04,2003-10-16
Andrew Lavers
  • 4,328
  • 1
  • 12
  • 19