I have the following data.table
:
library(data.table)
dt = data.table(c(1, 1, 1, 2, 2, 2, 2, 3, 4),
c(4, 4, 4, 5, 5, 6, 7, 4, 5))
V1 V2
1: 1 4
2: 1 4
3: 1 4
4: 2 5
5: 2 5
6: 2 6
7: 2 7
8: 3 4
9: 4 5
I want to study the different values of V2
for a given V1
. However, if all values of V2
for a given V1
are the same, that doesn't interest me, so I want to remove such rows.
Looking at the example above, the first three rows are perfectly identical (V1=1
, V2=4
), so I wish to remove them.
However, the next four rows include two identical rows and others with a different V2
. In this case, I want to show the three possible values of V2
given V1 = 2
: (2, 5)
, (2, 6)
and (2, 7)
.
The last two rows have unique V1
: that falls under the category of "all rows are perfectly identical", and so should be removed as well.
The best I could think of is shown in this answer:
dt[!duplicated(dt) & !duplicated(dt, fromLast = TRUE), ]
V1 V2
1: 2 6
2: 2 7
3: 3 4
4: 4 5
Which obviously isn't satisfactory: it removes the (2,5)
pair, since it is duplicated, and it keeps the (3,4)
and (4,5)
pairs since they're unique and therefore not flagged by either duplicated()
pass.
The other option would be simply calling
unique(dt)
V1 V2
1: 1 4
2: 2 5
3: 2 6
4: 2 7
5: 3 4
6: 4 5
But it keeps the (1,4)
, (3,4)
, (4,5)
pairs I want removed.
In the end, the result I'm looking for is:
V1 V2
1: 2 5
2: 2 6
3: 2 7
Though any other format is also acceptable, such as:
V1 V2.1 V2.2 V2.3
1: 2 5 6 7
(which shows the possible values of V2
for each "interesting" V1
)
I can't figure out how to differentiate the (1,4)
case (all rows are the same) from the (2,5)
case (there are some duplicates, but there are other rows with the same V1
, so we must remove the duplicate (2,5)
but leave one copy).
As for the unique rows, I've written a very ugly call, but it only works if there's only one unique row. If there's two, such as the example above, it fails.