mark duplicate values (not only the duplicates)

Question

I am stuck with the following problem. I have two dataframes (df1, df2) I'd like to left_join(df1, df, by=c("a", "b", "c")) using three variables a, b and c. Then, I noticed, that the number of rows in the joined dataframe increased. Therefore I checked, whether there were any duplicate entries:

duplicated(paste(df1$a, df1$b, df1$c)
duplicated(paste(df2$a, df2$b, df2$c)

In both dataframes I found several duplicate entries. Now here comes my question: How can I exclude those duplicates before I join the two dataframes? My problem is that duplicated() only marks the duplicated values (i.e. the second appearance). I would like to exclude the first appearance too. I hope you get the point.

Thank you for help!

So this `df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]`? — Sotos, Apr 13 '18 at 09:08
Possible duplicate of [Remove rows where all values of a column are identical, based on another column](https://stackoverflow.com/questions/49408420/remove-rows-where-all-values-of-a-column-are-identical-based-on-another-column) and https://stackoverflow.com/questions/38232998/how-to-remove-rows-that-have-only-1-combination-for-a-given-id?noredirect=1&lq=1 — jogo, Apr 13 '18 at 09:14

zoowalk · Answer 1 · 2018-04-13T10:54:14.463

0

difficult to provide a specifc answer without example, but one way might be to create a count variable for each combination (>1 when duplicate) and then filter these observations.

library(tidyverse)
x <- LETTERS[c(1,1:5)]
y <- LETTERS[c(1,1:5)]
df <- data.frame(x,y)

df <- df %>% 
  group_by(x,y) %>% 
  mutate(freq=n()) %>% 
  filter(!freq>1)

edited Apr 13 '18 at 10:54

answered Apr 13 '18 at 09:04

zoowalk

2,018
20
33

1

or just `df %>% group_by(x, y) %>% filter(n() == 1)` – Axeman Apr 13 '18 at 09:20

mark duplicate values (not only the duplicates)

1 Answers1