1

I'm working on a large dataset with n covariates. Many of the rows are duplicates. In order to identify the duplicates I need to use a subset of the covariates to create an identification variable. That is, (n-x) covariates are irrelevant. I want to concatenate the values on the x covariates to uniquely identify the observations and eliminate the duplicates.

set.seed(1234)
UNIT <- c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,6,6,6)
DATE <- c("1/1/2010","1/1/2010","1/1/2010","1/2/2012","1/2/2009","1/2/2004","1/2/2005","1/2/2005",
      "1/1/2011","1/1/2011","1/1/2011","1/1/2009","1/1/2008","1/1/2008","1/1/2012","1/1/2013",
      "1/1/2012")
OUT1 <- c(300,400,400,400,600,700,700,800,800,800,900,700,700,100,100,100,500)
JUNK1 <- c(rnorm(17,0,1))
JUNK2 <- c(rnorm(17,0,1))

test = data.frame(UNIT,DATE,OUT1,JUNK1,JUNK2)

'test' is a sample data frame. The variables I need to use to uniquely identify the observations are 'UNIT', 'DATE' and 'OUT1'. For example,

head(test)
  UNIT     DATE OUT1      JUNK1      JUNK2
1    1 1/1/2010  300 -1.2070657 -0.9111954
2    1 1/1/2010  400  0.2774292 -0.8371717
3    1 1/1/2010  400  1.0844412  2.4158352
4    1 1/2/2012  400 -2.3456977  0.1340882
5    2 1/2/2009  600  0.4291247 -0.4906859
6    2 1/2/2004  700  0.5060559 -0.4405479    

Observations 1 and 4 are not a duplicate in the dataset. Observations 2 and 3 are duplicates. The new dataset I want to create would keep observations 1 and 4 and only one of 2 and 3. The solution I have tried is:

subset(test, !duplicated(c(UNIT,DATE,OUT1)))

Which unfortunately does not do the trick:

      UNIT     DATE OUT1       JUNK1      JUNK2
1        1 1/1/2010  300 -1.20706575 -0.9111954
5        2 1/2/2009  600  0.42912469 -0.4906859
8        3 1/2/2005  800 -0.54663186 -0.6937202
11       4 1/1/2011  900 -0.47719270 -1.0236557
14       5 1/1/2008  100  0.06445882  1.1022975
15       6 1/1/2012  100  0.95949406 -0.4755931

Although it does ignore the irrelevant variables (JUNK1, JUNK2) , the technique is too greedy. The new dataset should contain three observations on unit one because there are three unique combinations of UNIT + DATE + OUT1 when UNIT = 1. Is there a way to achieve this without writing a function?

hubert_farnsworth
  • 797
  • 2
  • 9
  • 21

3 Answers3

10

You can pass a data.frame to duplicated

In your case, you want to pass the first 3 columns of test

 test2 <- test[!duplicated(test[,1:3]),]

If you are using big data, and want to embrace data.tables, then you can set the key to be the first three columns (which you want to remove the duplicates from) and then use unique

library(data.table)
DT <- data.table(test)
# set the key
setkey(DT, UNIT,DATE,OUT1)
DTU <- unique(DT)

For more details on duplicates and data.tables see Filtering out duplicated/non-unique rows in data.table

Community
  • 1
  • 1
mnel
  • 113,303
  • 27
  • 265
  • 254
  • 4
    Since there is a `bigdata` tag it might be worth mentioning this post on removing duplicates from a `data.table` - http://stackoverflow.com/questions/11792527/filtering-out-duplicated-non-unique-rows-in-data-table – thelatemail Mar 19 '13 at 01:07
0

Thanks! Looks like we can do:

test2 <- test[!duplicated(test[,c("OUT1","DATE","UNIT")]),]

and it delivers the goods as well. So, we can just use the column names rather than 1:3 and the order doesn't matter

hubert_farnsworth
  • 797
  • 2
  • 9
  • 21
  • This is no different from my answer, apart from noting how `[` works, which is irrelevant to the `duplicated` component – mnel Mar 19 '13 at 03:51
0

You can use distinct() from the dplyr package:

library(dplyr)
test %>%
  distinct(UNIT, DATE, OUT1)

Or without the %>% pipe:

distinct(test, UNIT, DATE, OUT1)
Sam Firke
  • 21,571
  • 9
  • 87
  • 105