R - Remove all rows that do not match certain criteria

Question

Below is what my data looks like. My goal is to find all rows that have matching x, z, and m, but different y. And I need to keep both, or all, of the ones that have the differing y.

How can I do that?

x <- c("A","B","C","B","D","E","E")
y <- c(0,10,10,10,10,12,0)
z <- c("A1","B1","B1","B1","B1","C1","C1")
m <- c(rep("2017-12-28",7))

df <- data.frame(x,y,z,m)
df

# Below is the goal
df[6:7,]

What are you comparing to? What is a value of a row the same as or different from? — camille, Nov 03 '21 at 00:08
Rows 6 and 7 shown in the goal are matching x, z, and m, but different y. So it is everything matching, but different y, as long as all different y rows are kept. — Indescribled, Nov 03 '21 at 00:12
You still have not defined whatever you are calling a `match` — GuedesBF, Nov 03 '21 at 00:13
@GuedesBF - I reckon it's `x/z/m` are used to define a group by their values, and then checking if there are multiple unique `y` inside that `x/z/m` group. — thelatemail, Nov 03 '21 at 00:18
matching = identical. Row 1 can go because there is only one instance of that specific x, z, and m on that date, so there cannot be different y values. Row 2 can go because while x, z, and m are identical to row 4, y is the same. Row 4 goes for the same reason. Does that help? — Indescribled, Nov 03 '21 at 00:18
Pretty sure this is a duplicate if something like `df[ave(df[["y"]], df[c("x","z","m")], FUN=\(x) length(unique(x)) ) > 1,]` does what you want. Maybe https://stackoverflow.com/questions/50953598/r-group-by-count-distinct-values-grouping-by-another-column essentially or https://stackoverflow.com/questions/12840294/counting-unique-distinct-values-by-group-in-a-data-frame? — thelatemail, Nov 03 '21 at 00:20
That code gave me an error, but I can try something like `df %>% distinct() %>% count(x) %>% filter(n > 1)` But that does not actually get rid of the rows and just tells me one of the three variables to filter on to remove unwanted rows. So I have to save the output x values, repeat the code for z, and m, then filter and it seems convoluted. But it might work — Indescribled, Nov 03 '21 at 00:31
The second question I linked would suggest - `df %>% group_by(x,z,m) %>% filter(n_distinct(y) > 1)` — thelatemail, Nov 03 '21 at 00:46

r2evans · Accepted Answer · 2021-11-03T00:53:27.563

base R

df[ave(df$y, df[,c("x","z","m")], FUN = function(y) length(unique(y))) > 1,]
#   x  y  z          m
# 6 E 12 C1 2017-12-28
# 7 E  0 C1 2017-12-28

Note that due to the way ave coerces its return value to the same class as the first argument, if y is something other than numeric or integer, this may not work perfectly as desired.

Also, for code-golf or readability, you can replace the FUN= argument with one of FUN=dplyr::n_distinct or data.table::uniqueN, if you prefer an ave solution and yet have one of those packages loaded.

dplyr

library(dplyr)
df %>%
  group_by(x, z, m) %>%
  filter(n_distinct(y) > 1) %>%
  ungroup()
# # A tibble: 2 x 4
#   x         y z     m         
#   <chr> <dbl> <chr> <chr>     
# 1 E        12 C1    2017-12-28
# 2 E         0 C1    2017-12-28

data.table

library(data.table)
as.data.table(df)[, .SD[uniqueN(y) > 1,], by = .(x, z, m)]
#         x      z          m     y
#    <char> <char>     <char> <num>
# 1:      E     C1 2017-12-28    12
# 2:      E     C1 2017-12-28     0

Data

df <- structure(list(x = c("A", "B", "C", "B", "D", "E", "E"), y = c(0, 10, 10, 10, 10, 12, 0), z = c("A1", "B1", "B1", "B1", "B1", "C1", "C1"), m = c("2017-12-28", "2017-12-28", "2017-12-28", "2017-12-28", "2017-12-28", "2017-12-28", "2017-12-28")), class = "data.frame", row.names = c(NA, -7L))

Not sure the joining back is necessary, you could just `select` if the order is important - `df %>% group_by(x,z,m) %>% filter(n_distinct(y) > 1) %>% select(names(df))` — thelatemail, Nov 03 '21 at 00:47
I was in the process of testing something like that when you commented, thanks for the final nudge in the right direction :-) — r2evans, Nov 03 '21 at 00:48

R - Remove all rows that do not match certain criteria

1 Answers1

base R

dplyr

data.table