Remove all duplicate rows, unless there's a "similar" row

Question

I have the following data.table:

library(data.table)
dt = data.table(c(1, 1, 1, 2, 2, 2, 2, 3, 4),
                c(4, 4, 4, 5, 5, 6, 7, 4, 5))
   V1 V2
1:  1  4
2:  1  4
3:  1  4
4:  2  5
5:  2  5
6:  2  6
7:  2  7
8:  3  4
9:  4  5

I want to study the different values of V2 for a given V1. However, if all values of V2 for a given V1 are the same, that doesn't interest me, so I want to remove such rows.

Looking at the example above, the first three rows are perfectly identical (V1=1, V2=4), so I wish to remove them.

However, the next four rows include two identical rows and others with a different V2. In this case, I want to show the three possible values of V2 given V1 = 2: (2, 5), (2, 6) and (2, 7).

The last two rows have unique V1: that falls under the category of "all rows are perfectly identical", and so should be removed as well.

The best I could think of is shown in this answer:

dt[!duplicated(dt) & !duplicated(dt, fromLast = TRUE), ]
   V1 V2
1:  2  6
2:  2  7
3:  3  4
4:  4  5

Which obviously isn't satisfactory: it removes the (2,5) pair, since it is duplicated, and it keeps the (3,4) and (4,5) pairs since they're unique and therefore not flagged by either duplicated() pass.

The other option would be simply calling

unique(dt)
   V1 V2
1:  1  4
2:  2  5
3:  2  6
4:  2  7
5:  3  4
6:  4  5

But it keeps the (1,4), (3,4), (4,5) pairs I want removed.

In the end, the result I'm looking for is:

Though any other format is also acceptable, such as:

   V1 V2.1 V2.2 V2.3
1:  2    5    6    7

(which shows the possible values of V2 for each "interesting" V1)

I can't figure out how to differentiate the (1,4) case (all rows are the same) from the (2,5) case (there are some duplicates, but there are other rows with the same V1, so we must remove the duplicate (2,5) but leave one copy).

As for the unique rows, I've written a very ugly call, but it only works if there's only one unique row. If there's two, such as the example above, it fails.

@akrun The expected ouput is given in the last two code blocks. — Wasabi, May 23 '19 at 21:15

akrun · Accepted Answer · 2019-05-23T21:25:53.017

6

An option would be to group by 'V1', get the index of group that has length of unique elements greater than 1 and then take the unique

unique(dt[dt[, .(i1 = .I[uniqueN(V2) > 1]), V1]$i1])
#   V1 V2
#1:  2  5
#2:  2  6
#3:  2  7

Or as @r2evans mentioned

unique(dt[, .SD[(uniqueN(V2) > 1)], by = "V1"])

NOTE: The OP's dataset is data.table and data.table methods are the natural way of doing it

If we need a tidyverse option, a comparable one to the above data.table option is

library(dplyr)
dt %>%
   group_by(V1) %>% 
   filter(n_distinct(V2) > 1) %>% 
   distinct()

edited May 23 '19 at 21:25

answered May 23 '19 at 21:18

akrun

874,273
37
540
662

1

I was just about to post `unique(dt[,.SD[ (length(unique(V2)) > 1), ], by = "V1" ])` ... and now just learned about `uniqueN`, thanks :-) – r2evans May 23 '19 at 21:20
1

Perhaps clearer? `unique(dt[, .SD[ (uniqueN(V2) > 1), ], by = "V1"])` – r2evans May 23 '19 at 21:21
2

@r2evans That is clean, but I think the `.I` would be faster from my previous benchmark experiments – akrun May 23 '19 at 21:22
Good to know, thanks. My experience with `data.table` doesn't yet hold an intuitive sense of comparative performance. ~10% improvement (median) is pretty good, wonder how it scales to "real" (not-small) datasets. – r2evans May 23 '19 at 21:23
3

@r2evans There's a benchmark on a relatively large table in [this question](https://stackoverflow.com/questions/50093919/dplyrslice-in-data-table). Another option (not that more are needed) is `unique(dt[, if(uniqueN(V2) > 1) .SD, by = "V1"])` – IceCreamToucan May 23 '19 at 21:29
This works like a charm, of course. I'm having trouble understanding how it works, though... How does `.I[uniqueN(V2) > 1]` work? `uniqueN(V2) > 1` returns `TRUE` (since there's more than one unique value), but when wrapped by `.I[]` (aka `seq_along`), it becomes `1, 2, ... nrows`. Is this a different `>` operator I'm unaware of? – Wasabi May 23 '19 at 21:43
2

@Wasabi What it does is `uniqueN(V2) > 1` gets a logical vector of length 1 for each group of 'V1' By wrapping with `.I`, it gives the row index of all the rows for that V1. extract that row index column `$i1` and use that in `i` for ssubsetting the rows and the wrap with `unique` – akrun May 23 '19 at 21:45

tmfmnk · Answer 2 · 2019-05-23T21:32:42.790

2

Also one dplyr possibility:

dt %>%
 group_by(V1) %>%
 filter(n_distinct(V2) != 1 & !duplicated(V2))

     V1    V2
  <dbl> <dbl>
1     2     5
2     2     6
3     2     7

Or:

dt %>%
 group_by(V1) %>%
 filter(n_distinct(V2) != 1) %>%
 group_by(V1, V2) %>%
 slice(1)

edited May 23 '19 at 21:32

answered May 23 '19 at 21:19

tmfmnk

38,881
4
47
67

score 2 · Answer 3 · answered May 23 '19 at 21:29

2

In your case with base R

dt[ave(dt$V2,dt$V1,FUN=function(x) length(unique(x)))>1&!duplicated(dt)]
   V1 V2
1:  2  5
2:  2  6
3:  2  7

answered May 23 '19 at 21:29

BENY

317,841
20
164
234

1

A small change to make it clearer could be `dt[with(dt,ave(V2,V1,FUN=function(x) length(unique(x)))>1&!duplicated(dt))]`. – tmfmnk May 23 '19 at 21:35
1

What I want to know: with all of that base-R code-golf going on, you keep one space in the middle? ;-) – r2evans May 23 '19 at 21:54
They have to save the extra whitespace for all of that mandatory indentation? (Full disclosure: I use python too, just not as proficient/vocal.) – r2evans May 23 '19 at 21:59
1

@WeNYoBen, I'd be really impressed with a pandas version of this :-P – r2evans May 23 '19 at 21:59
1

@r2evans keep the space is a good coding behavior which I should enhance:-) , `df[df.groupby('V1').V2.transform('nunique').gt(1)].drop_duplicates() ` – BENY May 23 '19 at 22:21

s_baldur · Answer 4 · 2019-05-27T09:12:25.053

Using an if statement allows for more concision and is, arguably, more data.table'ly:

dt[, if (uniqueN(V2) > 1) unique(V2), by = V1]

#    V1 V1
# 1:  2  5
# 2:  2  6
# 3:  2  7

But couldn't get the column names right...

A bit less concise solutions:

dt[, .(V2 = if (uniqueN(V2) > 1) unique(V2) else numeric(0)), by = V1]

dt[, .SD[if (uniqueN(V2) > 1) !duplicated(V2)], by = V1]

#    V1 V2
# 1:  2  5
# 2:  2  6
# 3:  2  7

Remove all duplicate rows, unless there's a "similar" row

4 Answers4

Linked

Remove *all* duplicate rows, unless there's a "similar" row

4 Answers4

Linked

Remove all duplicate rows, unless there's a "similar" row