removing duplicate units from data frame

Question

I'm working on a large dataset with n covariates. Many of the rows are duplicates. In order to identify the duplicates I need to use a subset of the covariates to create an identification variable. That is, (n-x) covariates are irrelevant. I want to concatenate the values on the x covariates to uniquely identify the observations and eliminate the duplicates.

set.seed(1234)
UNIT <- c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,6,6,6)
DATE <- c("1/1/2010","1/1/2010","1/1/2010","1/2/2012","1/2/2009","1/2/2004","1/2/2005","1/2/2005",
      "1/1/2011","1/1/2011","1/1/2011","1/1/2009","1/1/2008","1/1/2008","1/1/2012","1/1/2013",
      "1/1/2012")
OUT1 <- c(300,400,400,400,600,700,700,800,800,800,900,700,700,100,100,100,500)
JUNK1 <- c(rnorm(17,0,1))
JUNK2 <- c(rnorm(17,0,1))

test = data.frame(UNIT,DATE,OUT1,JUNK1,JUNK2)

'test' is a sample data frame. The variables I need to use to uniquely identify the observations are 'UNIT', 'DATE' and 'OUT1'. For example,

head(test)
  UNIT     DATE OUT1      JUNK1      JUNK2
1    1 1/1/2010  300 -1.2070657 -0.9111954
2    1 1/1/2010  400  0.2774292 -0.8371717
3    1 1/1/2010  400  1.0844412  2.4158352
4    1 1/2/2012  400 -2.3456977  0.1340882
5    2 1/2/2009  600  0.4291247 -0.4906859
6    2 1/2/2004  700  0.5060559 -0.4405479

Observations 1 and 4 are not a duplicate in the dataset. Observations 2 and 3 are duplicates. The new dataset I want to create would keep observations 1 and 4 and only one of 2 and 3. The solution I have tried is:

subset(test, !duplicated(c(UNIT,DATE,OUT1)))

Which unfortunately does not do the trick:

      UNIT     DATE OUT1       JUNK1      JUNK2
1        1 1/1/2010  300 -1.20706575 -0.9111954
5        2 1/2/2009  600  0.42912469 -0.4906859
8        3 1/2/2005  800 -0.54663186 -0.6937202
11       4 1/1/2011  900 -0.47719270 -1.0236557
14       5 1/1/2008  100  0.06445882  1.1022975
15       6 1/1/2012  100  0.95949406 -0.4755931

Although it does ignore the irrelevant variables (JUNK1, JUNK2) , the technique is too greedy. The new dataset should contain three observations on unit one because there are three unique combinations of UNIT + DATE + OUT1 when UNIT = 1. Is there a way to achieve this without writing a function?

score 10 · Answer 1 · edited May 23 '17 at 10:28

10

You can pass a data.frame to duplicated

In your case, you want to pass the first 3 columns of test

 test2 <- test[!duplicated(test[,1:3]),]

If you are using big data, and want to embrace data.tables, then you can set the key to be the first three columns (which you want to remove the duplicates from) and then use unique

library(data.table)
DT <- data.table(test)
# set the key
setkey(DT, UNIT,DATE,OUT1)
DTU <- unique(DT)

For more details on duplicates and data.tables see Filtering out duplicated/non-unique rows in data.table

edited May 23 '17 at 10:28

Community

1
1

answered Mar 19 '13 at 00:03

mnel

113,303
27
265
254

4

Since there is a `bigdata` tag it might be worth mentioning this post on removing duplicates from a `data.table` - http://stackoverflow.com/questions/11792527/filtering-out-duplicated-non-unique-rows-in-data-table – thelatemail Mar 19 '13 at 01:07

score 0 · Answer 2 · answered Mar 19 '13 at 01:35

0

Thanks! Looks like we can do:

test2 <- test[!duplicated(test[,c("OUT1","DATE","UNIT")]),]

and it delivers the goods as well. So, we can just use the column names rather than 1:3 and the order doesn't matter

answered Mar 19 '13 at 01:35

hubert_farnsworth

797
2
9
21

This is no different from my answer, apart from noting how `[` works, which is irrelevant to the `duplicated` component – mnel Mar 19 '13 at 03:51

score 0 · Answer 3 · answered May 29 '15 at 01:18

0

You can use distinct() from the dplyr package:

library(dplyr)
test %>%
  distinct(UNIT, DATE, OUT1)

Or without the %>% pipe:

distinct(test, UNIT, DATE, OUT1)

answered May 29 '15 at 01:18

Sam Firke

21,571
9
87
105

removing duplicate units from data frame

3 Answers3

Linked

Related