Remove any data frame rows containing a value in one column which has multiple matches in another column

Question

Let's say I have this (simplified) data frame:

C1 <- c('a','a','b','b','c','c')
C2 <- c(10,10,20,21,30,30)
C3 <- c(1.1,2.2,3.3,4.4,5.5,6.6)
df <- data.frame(C1,C2,C3)

C1	C2	C3
a	10	1.1
a	10	2.2
b	20	3.3
b	21	4.4
c	30	5.5
c	30	6.6

What I'm trying to do is to delete any rows containing a C1 value which has more than one match in the C2 column. In this case I would like to delete the entire rows containing 'b' in the C1 column (because 'b' has two matches - both 20 and 21 - in column C2).

This should result with this df:

C1	C2	C3
a	10	1.1
a	10	2.2
c	30	5.5
c	30	6.6

Any help would be really appreciated!

Thanks,

Yuval

It seems as if you've removed the *only* rows that did not match anything, but I'm probably missing something. You say you want to remove where something in `C1` matches something in `C2`, yet none of `c('a','b','c')` match any of `c(10,10,20,21,30,30)`. — r2evans, Apr 08 '21 at 18:21
Maybe "matches" was not a great expression. The point is that 'a' in C1 has only one "pair" in C2 (which is 10) while 'b' has more than one. — Yuval Harris, Apr 09 '21 at 09:14

score 1 · Answer 1 · answered Apr 08 '21 at 18:25

1

Subset is one way to do this:

C1 <- c('a','a','b','b','c','c')
C2 <- c(10,10,20,21,30,30)
C3 <- c(1.1,2.2,3.3,4.4,5.5,6.6)
df <- data.frame(C1,C2,C3)

df_table <- table(df$C2)
df_subset <- subset(df, C2 %in% names(df.table[df.table > 1]))
df_subset

answered Apr 08 '21 at 18:25

windyvation

497
3
13

score 1 · Accepted Answer · answered Apr 08 '21 at 18:56

1

dplyr is another way to do this. Use group_by to process each C1 group separately, then filter each group, keeping only groups with a single value of C2

library(dplyr)

C1 <- c('a','a','b','b','c','c')
C2 <- c(10,10,20,21,30,30)
C3 <- c(1.1,2.2,3.3,4.4,5.5,6.6)
df <- data.frame(C1,C2,C3)

df <- df %>%
    group_by(C1) %>%
    filter(length(unique(C2)) == 1) %>%
    ungroup()

print(df)

Output

# A tibble: 4 x 3
  C1       C2    C3
  <chr> <dbl> <dbl>
1 a        10   1.1
2 a        10   2.2
3 c        30   5.5
4 c        30   6.6

answered Apr 08 '21 at 18:56

Damian

1,385
10
10

Thank you so much! Was wondering if another help is possible regarding this: Any suggestions what should I do if I would want only to delete the unmatching row of 'b'? In this case, to keep only the row of 'b'-20-3.3 and delete the other row of 'b'-21-4.4 – Yuval Harris Apr 11 '21 at 13:32
I'm not sure what the exact logic should be, If you can describe it more it should possible to write a logical statement to get the desired subset. Example 1: `length(unique(C2)) == 1` subsets for groups with only one C2 value. Example 2: `length(unique(C2)) != 1 & C2 == min(C2)` would subset for rows with the lowest C2 value, but only for groups with multiple C2 values. You may have to get creative. You might want to open another question so you can provide a complete example that people can assist with or comment on. – Damian Apr 12 '21 at 16:04

Remove any data frame rows containing a value in one column which has multiple matches in another column

2 Answers2