I have a data.table with more than 130 000 rows.
I wanted to group two cols : dates
and progress
by a variable id
and put the values in a vector so I used aggregate()
.
df_agr <- aggregate(cbind(progress, dates) ~ id, data = df_test, FUN = c)
However it takes around 52 seconds to aggregate the data + I lose the date format from the col dates
.
An example of the dataframe :
id dates progress
1: 3505H6856 2003-07-10 yes
2: 3505H6856 2003-08-21 yes
3: 3505H6856 2003-09-04 yes
4: 3505H6856 2003-10-16 yes
5: 3505H67158 2003-01-14 yes
6: 3505H67158 2003-02-18 yes
7: 3505H67862 2003-03-06 yes
8: 3505H62168 2003-04-24 no
9: 3505H62168 2003-05-15 yes
10: 3505H65277 2003-02-11 yes
The result I get :
id progress dates
1 3505H62168 1, 2 5, 6
2 3505H65277 2 2
3 3505H67158 2, 2 1, 3
4 3505H67862 2 4
5 3505H6856 2, 2, 2, 2 7, 8, 9, 10
I was quite surprised to see that everything is converting into an integer
+ each row who seems to contain "independent" vectors are, in fact, vectors from a list :
'data.frame': 5 obs. of 3 variables:
$ id : chr "3505H62168" "3505H65277" "3505H67158" "3505H67862" ...
$ progress:List of 5
..$ 1: int 1 2
..$ 2: int 2
..$ 3: int 2 2
..$ 4: int 2
..$ 5: int 2 2 2 2
$ dates :List of 5
..$ 1: int 5 6
..$ 2: int 2
..$ 3: int 1 3
..$ 4: int 4
..$ 5: int 7 8 9 10
I tried to convert back the dates in the right format with :
lapply(df_agr$dates, function(x) as.Date(x, origin="1970-01-01"))
but I got :
$`1`
[1] "1970-01-06" "1970-01-07"
$`2`
[1] "1970-01-03"
$`3`
[1] "1970-01-02" "1970-01-04"
$`4`
[1] "1970-01-05"
$`5`
[1] "1970-01-08" "1970-01-09" "1970-01-10" "1970-01-11"
So it seems the origin is not "1970-01-01"
as it's written in the documentation, maybe the lowest date from the data ?
So my question is : how to get the same result I got with aggregate()
with data.table
while keeping the date format ?
So it means how to group by unique id with data.table. I tried :
setDT(df)[,list(col1 = c(progress), col2 = c(dates)), by = .(unique(id))]
But of course I got the followed error :
Error in
[.data.table
(df, , list(col1 = c(progress), col2 = c(dates)), : The items in the 'by' or 'keyby' list are length (5). Each must be same length as rows in x or number of rows returned by i (10).
Data :
structure(list(id = c("3505H6856", "3505H6856", "3505H6856",
"3505H6856", "3505H67158", "3505H67158", "3505H67862", "3505H62168",
"3505H62168", "3505H65277"), dates = structure(c(12243, 12285,
12299, 12341, 12066, 12101, 12117, 12166, 12187, 12094), class = "Date"),
progress = c("yes", "yes", "yes", "yes", "yes", "yes", "yes",
"no", "yes", "yes")), .Names = c("id", "dates", "progress"
), class = c("data.frame"), row.names = c(NA, -10L
))