I am trying to filter a grouped data frame where:
- Every duplicate in one column ("data_val") including the first instance of the duplicate is dropped.
- The row values in the columns corresponding to the duplicate columns are deleted as well (even if they are not duplicates).
EDIT: I was originally grouping by "sources" AND "db_source" but have found that I shouldn't be.
So, given:
example_data <- data.frame(sources = rep(c("S1", "S2", "S3"), each = 8),
data_val = rep(c(1, 1, 2, 3, 4, 4, 5, 6), 3),
db_source = rep(c("DB1", "DB2"), 12))
sources data_val db_source
1 S1 1 DB1
2 S1 1 DB2
3 S1 2 DB1
4 S1 3 DB2
5 S1 4 DB1
6 S1 4 DB2
7 S1 5 DB1
8 S1 6 DB2
9 S2 1 DB1
10 S2 1 DB2
11 S2 2 DB1
12 S2 3 DB2
13 S2 4 DB1
14 S2 4 DB2
15 S2 5 DB1
16 S2 6 DB2
17 S3 1 DB1
18 S3 1 DB2
19 S3 2 DB1
20 S3 3 DB2
21 S3 4 DB1
22 S3 4 DB2
23 S3 5 DB1
24 S3 6 DB2
I want the following:
sources data_val db_source
3 S1 2 DB1
4 S1 3 DB2
7 S1 5 DB1
8 S1 6 DB2
11 S2 2 DB1
12 S2 3 DB2
15 S2 5 DB1
16 S2 6 DB2
19 S3 2 DB1
20 S3 3 DB2
23 S3 5 DB1
24 S3 6 DB2
I tried using functions like `duplicated()` and `distinct()` in my pipe, but they will return the following:
sources data_val db_source
1 S1 1 DB1
2 S1 2 DB1
3 S1 3 DB2
4 S1 4 DB1
5 S1 5 DB1
6 S1 6 DB2
7 S2 1 DB1
8 S2 2 DB1
9 S2 3 DB2
10 S2 4 DB1
11 S2 5 DB1
12 S2 6 DB2
13 S3 1 DB1
14 S3 2 DB1
15 S3 3 DB2
16 S3 4 DB1
17 S3 5 DB1
18 S3 6 DB2
I understand the listed functions return the above because they check the vectors sequentially, but if there is a way to drop the first instance that would be great.
Thank you to anyone who can help.