1

I am trying to remove duplicates in a data frame of real estate sales data. Sometimes, one deed will contain many parcels as part of the sale. Because the data goes by parcel, one sale shows up as several rows, each different parcels but with the same deed number and sale price - we do not want these in our dataset.

I want to remove them entirely, without retaining a single row. Using duplicated() or unique() will remove the duplicates, but will retain one row of the duplicated values such that the new data frame contains unique values for that column. This will not work for us - we want to get rid of all duplicated rows entirely. So, how can I identify that deed_number and sale price are duplicated and get rid of all of these rows without retaining any?

1 Answers1

2

You could do a count by your grouping variables and only keep the records with a count equal to one:

library(tidyverse)

df <- tibble(
  deed = c(1, 1, 1, 2, 3, 3, 4, 5),
  price = c(10, 10, 10, 20, 30, 30, 40, 50)
)
df
#> # A tibble: 8 x 2
#>    deed price
#>   <dbl> <dbl>
#> 1     1    10
#> 2     1    10
#> 3     1    10
#> 4     2    20
#> 5     3    30
#> 6     3    30
#> 7     4    40
#> 8     5    50

df %>%
  count(deed, price) %>%
  filter(n == 1)
#> # A tibble: 3 x 3
#>    deed price     n
#>   <dbl> <dbl> <int>
#> 1     2    20     1
#> 2     4    40     1
#> 3     5    50     1

Created on 2019-03-20 by the reprex package (v0.2.1)

JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116