0

I have a data.table that contains several columns that I need to test for equality.

library(data.table)
dt <- data.table(a = 1:5, 
                 b = c("cat1", "cat1", "cat2", "cat4", "cat1"), 
                 c = c("cat2", "cat1", "cat2", "cat4", "cat3"), 
                 d = c("cat3", "cat1", "cat4", "cat4", "cat5"))
> dt
   a    b    c    d
1: 1 cat1 cat2 cat3
2: 2 cat1 cat1 cat1
3: 3 cat2 cat2 cat4
4: 4 cat4 cat4 cat4
5: 5 cat1 cat3 cat5

I want to keep rows where columns b, c, and d do not have the same values. (I want to examine rows where the cats aren't the same across all columns for that row.) In my actual problem, I have many more columns but they have similar names. This seems like it should be quite easy, but I'm coming up short.

This doesn't work.

> dt[length(unique(c(b, c, d))) > 1]
   a    b    c    d
1: 1 cat1 cat2 cat3
2: 2 cat1 cat1 cat1
3: 3 cat2 cat2 cat4
4: 4 cat4 cat4 cat4
5: 5 cat1 cat3 cat5

I thought maybe I need to make a column first and then do the filter, but making that column counting the unique number of values in each row doesn't work. It does it based on the entire set of columns.

dt[, unique_cats := length(unique(c(b, c, d)))]
dt
   a    b    c    d unique_cats
1: 1 cat1 cat2 cat3           5
2: 2 cat1 cat1 cat1           5
3: 3 cat2 cat2 cat4           5
4: 4 cat4 cat4 cat4           5
5: 5 cat1 cat3 cat5           5

Using info from this question, I have also tried to include an .I argument in a by clause but that doesn't work either.

> dt[, 
+    unique_cats := length(unique(c(b, c, d))),
+    by = .I]
> dt
   a    b    c    d unique_cats
1: 1 cat1 cat2 cat3           5
2: 2 cat1 cat1 cat1           5
3: 3 cat2 cat2 cat4           5
4: 4 cat4 cat4 cat4           5
5: 5 cat1 cat3 cat5           5

What am I doing wrong here?

Nick Criswell
  • 1,733
  • 2
  • 16
  • 32
  • 1
    `dt[!((b==c)|(b==d)|(c==d))]` ? – Cath Jul 02 '19 at 14:07
  • 1
    Or `dt[sapply(apply(dt, 1, unique), length) > 2, ]` – warnbergg Jul 02 '19 at 14:09
  • 1
    I guess what you wanted to write in your ssecond attempt was more: `dt[, unique_cats := length(unique(unlist(.SD))), by = 1:nrow(dt), .SDcols=c("b", "c", "d")]` – Cath Jul 02 '19 at 14:11
  • ...or `dt[, unique_cats := length(unique(c(b, c, d))), by = seq_len(nrow(dt))]` – AntoniosK Jul 02 '19 at 14:13
  • 1
    Thank you, all. I think @Cath's second suggestion is what I want. I should have noted in my question that I have more than three columns to compare but that all have similar names so I can regex my way into getting their names but using an OR clause would be difficult to keep track of. I'll make that clarification now. @Cath, please feel free to post as a solution, and I would be extremely grateful for some explanation on how `.SD` works in this context. – Nick Criswell Jul 02 '19 at 14:21
  • 1
    `.SD` represents the columns you name in `.SDcols`, it's your data.table but just the subset of columns. – Cath Jul 02 '19 at 14:26

0 Answers0