1

I am trying to filter a grouped data frame where:

  • Every duplicate in one column ("data_val") including the first instance of the duplicate is dropped.
  • The row values in the columns corresponding to the duplicate columns are deleted as well (even if they are not duplicates).

EDIT: I was originally grouping by "sources" AND "db_source" but have found that I shouldn't be.

So, given:

example_data <- data.frame(sources = rep(c("S1", "S2", "S3"), each = 8),
                           data_val = rep(c(1, 1, 2, 3, 4, 4, 5, 6), 3),
                           db_source = rep(c("DB1", "DB2"), 12))

       sources data_val db_source
1       S1        1       DB1
2       S1        1       DB2
3       S1        2       DB1
4       S1        3       DB2
5       S1        4       DB1
6       S1        4       DB2
7       S1        5       DB1
8       S1        6       DB2
9       S2        1       DB1
10      S2        1       DB2
11      S2        2       DB1
12      S2        3       DB2
13      S2        4       DB1
14      S2        4       DB2
15      S2        5       DB1
16      S2        6       DB2
17      S3        1       DB1
18      S3        1       DB2
19      S3        2       DB1
20      S3        3       DB2
21      S3        4       DB1
22      S3        4       DB2
23      S3        5       DB1
24      S3        6       DB2

I want the following:

       sources data_val db_source
3       S1        2       DB1
4       S1        3       DB2
7       S1        5       DB1
8       S1        6       DB2
11      S2        2       DB1
12      S2        3       DB2
15      S2        5       DB1
16      S2        6       DB2
19      S3        2       DB1
20      S3        3       DB2
23      S3        5       DB1
24      S3        6       DB2

I tried using functions like `duplicated()` and `distinct()` in my pipe, but they will return the following:

       sources data_val db_source   
 1      S1        1       DB1      
 2      S1        2       DB1      
 3      S1        3       DB2      
 4      S1        4       DB1      
 5      S1        5       DB1      
 6      S1        6       DB2      
 7      S2        1       DB1      
 8      S2        2       DB1      
 9      S2        3       DB2      
10      S2        4       DB1      
11      S2        5       DB1      
12      S2        6       DB2      
13      S3        1       DB1      
14      S3        2       DB1      
15      S3        3       DB2      
16      S3        4       DB1      
17      S3        5       DB1      
18      S3        6       DB2

I understand the listed functions return the above because they check the vectors sequentially, but if there is a way to drop the first instance that would be great.

Thank you to anyone who can help.

r2evans
  • 141,215
  • 6
  • 77
  • 149
CodeToDrum
  • 11
  • 2
  • According to your grouping of `source` and `db_source`, there are no duplicates in this data. Perhaps you shouldn't be grouping on `db_source`? – r2evans May 08 '22 at 18:11
  • Ahh, oppps. Didn't set up the initial object output correctly. Basically, "source" relates to specific locations and "db_source" relates to different versions of a database that were being created and edited incongruently resulting in discrepancies between databases (Using Access). I am trying to find everything in one database not in the other, and vice versa. I made a join combining the data, and I was trying to delete every row where both databases contained the same "data_val". – CodeToDrum May 08 '22 at 20:55
  • Just added an edit to the initial problem statement. Hope it makes things a little more clear. Sorry about that! – CodeToDrum May 08 '22 at 20:58

2 Answers2

1

dplyr

library(dplyr)
example_data %>%
  group_by(sources, data_val) %>%
  filter(n() < 2) %>%
  ungroup()
# # A tibble: 12 x 3
#    sources data_val db_source
#    <chr>      <dbl> <chr>    
#  1 S1             2 DB1      
#  2 S1             3 DB2      
#  3 S1             5 DB1      
#  4 S1             6 DB2      
#  5 S2             2 DB1      
#  6 S2             3 DB2      
#  7 S2             5 DB1      
#  8 S2             6 DB2      
#  9 S3             2 DB1      
# 10 S3             3 DB2      
# 11 S3             5 DB1      
# 12 S3             6 DB2      

base R

Either one works:

example_data[ave(example_data[ave(seq_len(nrow(example_data)), example_data[c("sources","data_val")], FUN=length) < 2,]
#    sources data_val db_source
# 3       S1        2       DB1
# 4       S1        3       DB2
# 7       S1        5       DB1
# 8       S1        6       DB2
# 11      S2        2       DB1
# 12      S2        3       DB2
# 15      S2        5       DB1
# 16      S2        6       DB2
# 19      S3        2       DB1
# 20      S3        3       DB2
# 23      S3        5       DB1
# 24      S3        6       DB2
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Hey there. Thank you for the initial solutions. My apologize in advance for not providing a better example, but I just made some edits to further elucidate my problem. Those solutions worked for the initial structure, but I hope my updated edit makes more sense. Basically, where I find duplicates in the "data_val" column, I want to drop every observation including the initial value. – CodeToDrum May 08 '22 at 22:20
  • See my edits, perhaps it helps? – r2evans May 08 '22 at 22:58
  • There is another post that answers this question (which they closed this post for): https://stackoverflow.com/questions/13763216/how-can-i-remove-all-duplicates-so-that-none-are-left-in-a-data-frame. At least, one of the lower rated replies follows your dplyr logic, which I preferred more than the top rated solution on that post. I think your base r solution makes more sense for my purposes than theirs, but I guess its a matter of style at that point. Thank you for the help! – CodeToDrum May 08 '22 at 23:09
  • Yup, I saw the closure (not surprising, I just hadn't looked yet, this is not an uncommon question ... the biggest reason I didn't look earlier was because the logic was a bit confusing, and providing the interim answers seemed a good step to help figure that out. Despite the closure, I edited in the changes in case they are helpful. Having it closed doesn't stop such actions like "answering" or "editing", it just means this question won't percolate to the top of Stack searches. Either way, I hope it helps. – r2evans May 08 '22 at 23:18
  • Yeah, have been trying to sort the logic out for a few days, but you 100% helped. I appreciate it. Have one "to-do" at work I can cross off quickly tomorrow morning! Cheers. – CodeToDrum May 08 '22 at 23:32
0

Another optione would be using lag function:

library(dplyr)

example_data %>%
  group_by(sources) %>% 
  filter(data_val != lag(data_val)) %>% 
  ungroup()


  sources data_val db_source
  <chr>      <dbl> <chr>    
1 S1             2 DB1      
2 S1             3 DB2      
3 S2             2 DB1      
4 S2             3 DB2  
TarJae
  • 72,363
  • 6
  • 19
  • 66