How to create a dummy variable based on other columns values in R?

Question

I am cleaning a scraped dataset from duplicates. I want to create a dummy variable indicating whether I have two or more observations that are identical in all conditions or all conditions but one.

Here's an example of my dataset:

Postcode	nrooms	price	sqm
76	1	259	30
75	5	380	120
75	5	400	120
75	2	450	80
76	1	259	30

Here's the dummy I want:

Postcode	nrooms	price	sqm	dummy
76	1	259	30	1
75	5	380	120	1
75	5	400	120	1
75	2	450	80	0
76	1	259	30	1

Where first and last rows have same values over all characteristics, the second and the third have same values in all characteristics but one (the price).

Could someone help me with this?

Thanks!

Is this grouped by `postcode`? – Rui Barradas Feb 25 '21 at 16:38 — Rui Barradas, Feb 25 '21 at 16:38

bouncyball · Answer 1 · 2021-02-25T16:44:56.397

Using two apply calls and the duplicated function (see this previous SO answer). We loop over all combinations of columns of size size ncol - 1, looking for duplicates using duplicated. Since you're looking for duplicates of all columns or all but one, we just need to look at combinations of size ncol - 1. Then we loop over the result of that operation to find out if any of the rows have duplicates for any of the column combinations.

apply(
  apply(combn(ncol(dat), ncol(dat) - 1),  
        2,
        FUN = function(cc) 
          duplicated(dat[,cc]) | duplicated(dat[,cc], fromLast = TRUE)),
  1, 
  max)

# [1] 1 1 1 0 1

As always, with a loop inside a loop, it can be helpful to step through each part of this. Inspecting output from combn(ncol(dat), ncol(dat) - 1) and then the inner apply

How to create a dummy variable based on other columns values in R?

1 Answers1