Remove duplicated rows by ID in R data.table, but add a new column with the concatenated dates from another column

Question

I have a large data table of patient data. I want to delete rows where "id" is duplicated without losing the information in the "date" column.

id  date
01  2004-07-01
02  NA
03  2013-11-15
03  2005-03-15
04  NA
05  2011-07-01
05  2012-07-01

I could do this one of two ways -

create a column that writes over the date column values to concatenate all the dates for that ID, i.e.:
```
id  date_new
01  2004-07-01
02  NA
03  2013-11-15; 2005-03-15
04  NA
05  2011-07-01; 2012-07-01
```

or

create one new column for each additional date, i.e.:

id  date_new    date_new2
01  2004-07-01  NA
02  NA          NA
03  2013-11-15  2005-03-15
04  NA          NA
05  2011-07-01  2012-07-01

I have tried a few things, but they keep crashing my R session (I get the message R Session Aborted. R encountered a fatal error. The session was terminated.):

setkey(DT, "id")
unique_DT <- subset(unique(DT))

and:

DT[!duplicated(DT[, "id", with = FALSE])]

However, besides crashing R, neither of these solutions does what I want with the dates.

Any ideas? I am new to data table (and R generally) but I have the vague sense that I could solve this with := somehow.

That does work, but how do I make it so that the output is appended to the larger data table rather than as a separate data.table? — epi_n00b, Nov 02 '15 at 19:15
What do you mean by append? Just assigning the result of that line to `dt` isn't enough? — nicola, Nov 02 '15 at 19:17
@nicola, it makes it so that `dt` is a data table with two columns, `id` and `date_new` . In my data table there are many other columns. I could just merge them by `id`, but I was wondering if there was a more efficient way to do it. — epi_n00b, Nov 02 '15 at 19:23
You want to remove rows or not? If so, which value of the other columns you keep? Try for instance `dt[,c(list(date_new=paste(date,collapse="; ")),.SD),by=id]` and see if that is what you are looking for. If you want to keep the first row of the other columns, you may try `dt[,c(list(date_new=paste(date,collapse="; ")),.SD[1,]),by=id]`. — nicola, Nov 02 '15 at 19:26
@nicola, The first solution works, although R doesn't recognize that the new column name was given in the command and it produces instead a column called `V1`. Of course this is not actually an issue because renaming the column is easy, I just thought I'd note it here. Thanks so much. — epi_n00b, Nov 02 '15 at 19:53
Oh, you are right, the `date_new` was in the wrong place: `dt[,c(date_new=paste(date,collapse="; "),.SD),by=id]`. — nicola, Nov 02 '15 at 20:06

score 2 · Accepted Answer · answered Nov 02 '15 at 21:02

2

Try this:

dt[,c(date_new=paste(date,collapse="; "),.SD),by=id]

answered Nov 02 '15 at 21:02

nicola

24,005
3
35
56

s_scolary · Answer 2 · 2015-11-02T20:17:10.967

You can use the aggregate function and it should do what you want. I was having some trouble with the dates switching to factors, but it seems like enclosing the date string with I() keeps it as a character.

id=c(1,2,3,3,4,5,5)
date = c("2004-07-01","NA","2013-11-15","2005-03-15","NA",
         "2011-07-01","2012-07-01")
data=as.data.frame(list(id=id,date=date))

data$date=as.character(data$date)

aggregate(list(date = I(data$date)),by=list(id = data$id),c)

  id                   date
1  1             2004-07-01
2  2                     NA
3  3 2013-11-15, 2005-03-15
4  4                     NA
5  5 2011-07-01, 2012-07-01

edit: used the aggregate function but used paste instead of c. Changing the collapse option to ";" should solve the separator problem

newdata = aggregate(list(date = I(data$date)),
                    by=list(id = data$id),
                    function(x){paste(unique(x),collapse=";")})
newdata


  id                  date
1  1            2004-07-01
2  2                    NA
3  3 2013-11-15;2005-03-15
4  4                    NA
5  5 2011-07-01;2012-07-01

`data.table` is meant to replace functions such as `aggregate` in fact. — David Arenburg, Nov 02 '15 at 19:52
Thank you. This works for the most part, except that the output looks like this when there are multiple dates: `c(NA, "2011-07-01", "2012-07-01")` instead of just `2011-07-01; 2012-07-01` — epi_n00b, Nov 02 '15 at 19:55

Remove duplicated rows by ID in R data.table, but add a new column with the concatenated dates from another column

2 Answers2

Linked