0

I have a large data table of patient data. I want to delete rows where "id" is duplicated without losing the information in the "date" column.

id  date
01  2004-07-01
02  NA
03  2013-11-15
03  2005-03-15
04  NA
05  2011-07-01
05  2012-07-01

I could do this one of two ways -

  1. create a column that writes over the date column values to concatenate all the dates for that ID, i.e.:

    id  date_new
    01  2004-07-01
    02  NA
    03  2013-11-15; 2005-03-15
    04  NA
    05  2011-07-01; 2012-07-01
    

or

  1. create one new column for each additional date, i.e.:

    id  date_new    date_new2
    01  2004-07-01  NA
    02  NA          NA
    03  2013-11-15  2005-03-15
    04  NA          NA
    05  2011-07-01  2012-07-01
    

I have tried a few things, but they keep crashing my R session (I get the message R Session Aborted. R encountered a fatal error. The session was terminated.):

setkey(DT, "id")
unique_DT <- subset(unique(DT))

and:

DT[!duplicated(DT[, "id", with = FALSE])]

However, besides crashing R, neither of these solutions does what I want with the dates.

Any ideas? I am new to data table (and R generally) but I have the vague sense that I could solve this with := somehow.

m0nhawk
  • 22,980
  • 9
  • 45
  • 73
epi_n00b
  • 150
  • 3
  • 15
  • Try `dt[,list(date_new=paste(date,collapse="; ")),by=id]`. – nicola Nov 02 '15 at 19:08
  • That does work, but how do I make it so that the output is appended to the larger data table rather than as a separate data.table? – epi_n00b Nov 02 '15 at 19:15
  • What do you mean by append? Just assigning the result of that line to `dt` isn't enough? – nicola Nov 02 '15 at 19:17
  • @nicola, it makes it so that `dt` is a data table with two columns, `id` and `date_new` . In my data table there are many other columns. I could just merge them by `id`, but I was wondering if there was a more efficient way to do it. – epi_n00b Nov 02 '15 at 19:23
  • You want to remove rows or not? If so, which value of the other columns you keep? Try for instance `dt[,c(list(date_new=paste(date,collapse="; ")),.SD),by=id]` and see if that is what you are looking for. If you want to keep the first row of the other columns, you may try `dt[,c(list(date_new=paste(date,collapse="; ")),.SD[1,]),by=id]`. – nicola Nov 02 '15 at 19:26
  • @nicola, The first solution works, although R doesn't recognize that the new column name was given in the command and it produces instead a column called `V1`. Of course this is not actually an issue because renaming the column is easy, I just thought I'd note it here. Thanks so much. – epi_n00b Nov 02 '15 at 19:53
  • Oh, you are right, the `date_new` was in the wrong place: `dt[,c(date_new=paste(date,collapse="; "),.SD),by=id]`. – nicola Nov 02 '15 at 20:06
  • @nicola Please post an answer. – David Arenburg Nov 02 '15 at 20:36

2 Answers2

2

Try this:

dt[,c(date_new=paste(date,collapse="; "),.SD),by=id]
nicola
  • 24,005
  • 3
  • 35
  • 56
0

You can use the aggregate function and it should do what you want. I was having some trouble with the dates switching to factors, but it seems like enclosing the date string with I() keeps it as a character.

id=c(1,2,3,3,4,5,5)
date = c("2004-07-01","NA","2013-11-15","2005-03-15","NA",
         "2011-07-01","2012-07-01")
data=as.data.frame(list(id=id,date=date))

data$date=as.character(data$date)

aggregate(list(date = I(data$date)),by=list(id = data$id),c)

  id                   date
1  1             2004-07-01
2  2                     NA
3  3 2013-11-15, 2005-03-15
4  4                     NA
5  5 2011-07-01, 2012-07-01

edit: used the aggregate function but used paste instead of c. Changing the collapse option to ";" should solve the separator problem

newdata = aggregate(list(date = I(data$date)),
                    by=list(id = data$id),
                    function(x){paste(unique(x),collapse=";")})
newdata


  id                  date
1  1            2004-07-01
2  2                    NA
3  3 2013-11-15;2005-03-15
4  4                    NA
5  5 2011-07-01;2012-07-01
s_scolary
  • 1,361
  • 10
  • 21
  • 3
    `data.table` is meant to replace functions such as `aggregate` in fact. – David Arenburg Nov 02 '15 at 19:52
  • Thank you. This works for the most part, except that the output looks like this when there are multiple dates: `c(NA, "2011-07-01", "2012-07-01")` instead of just `2011-07-01; 2012-07-01` – epi_n00b Nov 02 '15 at 19:55