Removing all duplicates in data.table, adding column with list of identifiers

Question

Building on my previous question, I have the following data.table:

> dt = data.table(V1 = c(1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 4),
                  V2 = c(4, 4, 4, 5, 5, 6, 6, 7, 7, 4, 5),
                  ID = c(1, 2, 3, 1, 2, 2, 2, 1, 3, 1, 1))
    V1 V2 ID
 1:  1  4  1
 2:  1  4  2
 3:  1  4  3
 4:  2  5  1
 5:  2  5  2
 6:  2  6  2
 7:  2  6  2
 8:  2  7  1
 9:  2  7  3
10:  3  4  1
11:  4  5  1

with many duplicate rows (when looking at V1 and V2).

I want to remove all the "entirely duplicate" rows (those where V2 is always the same for a given V1), including the ones with a unique V1 (since all one V2 is equal to itself).

I was given many solutions in the previous question linked above, including @akrun's (modified for this new table):

> newDT = unique(dt[dt[, .(i1 = .I[uniqueN(V2) > 1]), V1]$i1], by = c("V1", "V2"))
   V1 V2 ID
1:  2  5  1
2:  2  6  2
3:  2  7  3

However, I actually need to store a list of IDs which had a given result. I can collect the list of IDs for a given pair of (V1, V2) trivially with

> unique(dt[V1 == 2 & V2 == 5, ID])
[1] 1 2

(unique is necessary because a given (V1,V2) pair may also have duplicate ID, see dt rows 6:7) and add it with

newDT[, ID := .(.(unique(dt[V1 == 2 & V2 == 5, ID])))]

but I can't figure out how to "iterate" over each (V1,V2) pair.

In summary, I need the output to be

> newDT
   V1 V2      ID
1:  2  5  (1, 2)
2:  2  6     (2)
3:  2  7  (1, 3)

score 5 · Accepted Answer · answered May 24 '19 at 15:03

5

After subsetting the rows of data, grouped by 'V1', 'V2', create a list of unique 'ID'

dt[dt[, .(i1 = .I[uniqueN(V2) > 1]), V1]$i1][, .(ID = list(unique(ID))), .(V1, V2)]
#   V1 V2  ID
#1:  2  5 1,2
#2:  2  6   2
#3:  2  7 1,3

answered May 24 '19 at 15:03

akrun

874,273
37
540
662

score 0 · Answer 2 · answered May 27 '19 at 09:17

dt[, .(ID = .(unique(ID))), by = .(V1, V2)
   ][dt[, .(V2 = if (uniqueN(V2) > 1) unique(V2) else numeric(0)), by = V1], on = .(V1, V2)]

Broken down into steps:

# Extract our pairs of interests first
our_pairs <- dt[, .(V2 = if (uniqueN(V2) > 1) unique(V2) else numeric(0)), by = V1]
our_pairs
#    V1 V2
# 1:  2  5
# 2:  2  6
# 3:  2  7

# Aggregate all ID's for each pair on original data
ids_forpairs <- dt[, .(ID = .(unique(ID))), by = .(V1, V2)]
ids_forpairs
#    V1 V2    ID
# 1:  1  4 1,2,3
# 2:  2  5   1,2
# 3:  2  6     2
# 4:  2  7   1,3
# 5:  3  4     1
# 6:  4  5     1

# Then filter by our pairs of interest
ids_forpairs[our_pairs, on = .(V1, V2)]
#    V1 V2  ID
# 1:  2  5 1,2
# 2:  2  6   2
# 3:  2  7 1,3

Removing all duplicates in data.table, adding column with list of identifiers

2 Answers2