0

I am cleaning a scraped dataset from duplicates. I want to create a dummy variable indicating whether I have two or more observations that are identical in all conditions or all conditions but one.

Here's an example of my dataset:

Postcode nrooms price sqm
76 1 259 30
75 5 380 120
75 5 400 120
75 2 450 80
76 1 259 30

Here's the dummy I want:

Postcode nrooms price sqm dummy
76 1 259 30 1
75 5 380 120 1
75 5 400 120 1
75 2 450 80 0
76 1 259 30 1

Where first and last rows have same values over all characteristics, the second and the third have same values in all characteristics but one (the price).

Could someone help me with this?

Thanks!

Daniela
  • 11
  • 1

1 Answers1

0

Using two apply calls and the duplicated function (see this previous SO answer). We loop over all combinations of columns of size size ncol - 1, looking for duplicates using duplicated. Since you're looking for duplicates of all columns or all but one, we just need to look at combinations of size ncol - 1. Then we loop over the result of that operation to find out if any of the rows have duplicates for any of the column combinations.

apply(
  apply(combn(ncol(dat), ncol(dat) - 1),  
        2,
        FUN = function(cc) 
          duplicated(dat[,cc]) | duplicated(dat[,cc], fromLast = TRUE)),
  1, 
  max)

# [1] 1 1 1 0 1

As always, with a loop inside a loop, it can be helpful to step through each part of this. Inspecting output from combn(ncol(dat), ncol(dat) - 1) and then the inner apply

bouncyball
  • 10,631
  • 19
  • 31