0

I want to subset multiple data tables to records that correspond to unique IDs. Similar threads are here and here; these solutions work if I program each data table separately, but I would like to run the code on all of the data tables I have.

My data looks something like this:

dt1 <- data.table(var1 = c("A", "B", "C"), var2 = c("...", "...", "..."), id = c("101", "105", "105"))
dt2 <- data.table(var1 = c("D", "E", "F"), var2 = c("...", "...", "..."), var3 = c("...", "...", "..."), id = c("110", "110", "139"))
dt3 <- data.table(var1 = c("B", "G", "H"), id = c("105", "113", "113"))

> dt1
   var1 var2  id
1:    A  ... 101
2:    B  ... 105
3:    C  ... 105
> dt2
   var1 var2 var3  id
1:    D  ...  ... 110
2:    E  ...  ... 110
3:    F  ...  ... 139
> dt3
   var1  id
1:    B 105
2:    G 113
3:    H 113

and I would like to remove all duplicate IDs so that each data table would eventually be this:

> dt1
   var1 var2  id
1:    A  ... 101
> dt2
   var1 var2 var3  id
3:    F  ...  ... 139
> dt3
   var1  id
1:    B 105

I can successfully subset each data table separately by using dt1 <- dt1[, if(.N==1) .SD, by = id] and so on, but instead I would like to update each table at once.

I've tried lapply(list(dt1, dt2, dt3), function(x) x[, if(.N==1) .SD, by = id]) but this does not update my tables in my environment. If I save the result in a list, for example using sub <- lapply(list(dt1, dt2, dt3), function(x) x[, if(.N==1) .SD, by = id]) I can save the data tables as dt1 <- as.data.table(sub[1]) etc., but then I am back to manually coding for each data table, which I am trying to avoid. I also tried lapply(names(sub), function(x) {assign(x, value=sub[[x]], envir=globalenv())}) which just returned list() in my console and also didn't update anything as far as I could tell. Using list2env(sub, envir=.GlobalEnv) as suggested here resulted in an error: "names(x) must be a character vector of the same length as x".

Any suggestions on how to remove the duplicate ID records within my original data tables simultaneously?

ktf
  • 49
  • 5

2 Answers2

1

This might not be the best way to go about it, though it solves your problem. Sometimes we can't avoid the for loop.

y=c("dt1","dt2","dt3")
for(i in 1:3){
  assign(y[i],(get(y[i])[, if(.N==1) .SD, by = id]))
 }

Now try and call the dt1, and the rest

Onyambu
  • 67,392
  • 3
  • 24
  • 53
1

Here is another option with duplicated. We use mget to get the values of the objects in a list, then using duplicated remove the rows where 'id's are duplicates. It is better to keep it in a list, but if it is needed, then update the objects in the global environment with list2env

list2env(lapply(mget(paste0("dt", 1:3)), 
    function(x) x[!(duplicated(id)|duplicated(id, fromLast=TRUE))]), .GlobalEnv)
dt1
#   var1 var2  id
#1:    A  ... 101

dt2
#   var1 var2 var3  id
#1:    F  ...  ... 139

dt3
#   var1  id
#1:    B 105
akrun
  • 874,273
  • 37
  • 540
  • 662