1

I have a data.table and need to do some cleaning for multiple values that should be unique -- a simplified example should help:

> DT 
  id type
1: 1    A
2: 2    X
3: 3    X
4: 3    G
5: 3    NA
6: 4    D
7: 5    NA

The issue I'm having is multiple values for "type" for the same ID, or:

> DT[id == 3]
   id type
1: 3    X
2: 3    G
3: 3    NA

The source of the multiple values is irrelevant, but I'd like to clean this in a specific manner: take the last reported value unless it's an NA. So the cleaning should result in a single row per ID, and in the example case would look like:

> DTclean
  id type
1: 1    A
2: 2    X
3: 3    G
4: 4    D
5: 5    NA

My current approach is to sort DT by type, so that all the NA's are first, and then use duplicated -- this is acceptable, but I feel there is a better method, plus though not crucial, this doesn't always take the last reported value -- in the above case it takes X instead of G.

This is my current approach:

> setorder(DT, type)
> DTclean <- DT[!duplicated(id, fromLast = T)]
> DTclean
  id type
1: 1    A
2: 2    X
3: 3    X
4: 4    D
5: 5    NA

Any ideas/help would be much appreciated!

daRknight
  • 253
  • 3
  • 17

2 Answers2

1

A slight variation on your idea should work. The concept is that you want a separate column indicating only whether or not "type" is NA, not the value, and sort by that column and then the original ID:

> DT$typena<-is.na(DT$type)
> setorderv(DT,c('typena','id'),order=c(-1,1))
> DT[!duplicated(id,fromLast=T)]
   id type typena
1:  5   NA   TRUE
2:  1    A  FALSE
3:  2    X  FALSE
4:  3    G  FALSE
5:  4    D  FALSE

You can then remove the extra column again if you don't want it in the way by assigning NULL to it, or do it all in one step:

DT[!duplicated(id,fromLast=T),c('id','type'),with=F]

Here is a dput of the sample data above:

DT <- structure(list(id = c(1L, 2L, 3L, 3L, 3L, 4L, 5L), type = c("A", 
      "X", "X", "G", NA, "D", NA)), .Names = c("id", "type"), 
      row.names =  c(NA, -7L), class = c("data.table", "data.frame"))
daRknight
  • 253
  • 3
  • 17
1

I got a slightly different approach, which also yields the desired output. You just need two additional columns, but it should be straightforward. The idea is based on the Row_Number() functions available in T-SQL.

  library(data.table)
  id <- as.integer(c(1,2,3,3,3,4,5))
  type <- as.character(c("A", "X", "X", "G", NA, "D", NA))

  DT <- data.table(id,type)

  DT[, Index := 1:.N]
  DT[,idRank:=rank(Index), by = c("id" )][idRank == 1, .(id, type)]
hannes101
  • 2,410
  • 1
  • 17
  • 40
  • This does indeed work, and I like the use of `rank` -- voted the other answer as it only required a single additional column. Thanks for your input, Hannes, I appreciate the different approaches available – daRknight Nov 01 '16 at 21:43