R: Checking if a set of variables forms a unique index

Question

I have a large dataframe and I want to check whether the values a set of (factor) variables uniquely identifies each row of the data or not.

My current strategy is to aggregate by the variables that I think are the index variables

dfAgg = aggregate(dfTemp$var1, by = list(dfTemp$var1, dfTemp$var2, dfTemp$var3), FUN = length)
stopifnot(sum(dfAgg$x > 1) == 0)

But this strategy takes forever. A more efficient method would be appreciated.

Thanks.

How "large" is large, number of rows and columns? – zx8754 Apr 03 '14 at 10:30 — zx8754, Apr 03 '14 at 10:30

Arun · Accepted Answer · 2014-04-03T12:32:49.570

The data.table package provides very fast duplicated and unique methods for data.tables. It also has a by= argument where you can provide the columns on which the duplicated/unique results should be computed from.

Here's an example of a large data.frame:

require(data.table)
set.seed(45L)
## use setDT(dat) if your data is a data.frame, 
## to convert it to a data.table by reference
dat <- data.table(var1=sample(100, 1e7, TRUE), 
                 var2=sample(letters, 1e7, TRUE), 
                 var3=sample(as.numeric(sample(c(-100:100, NA), 1e7,TRUE))))

system.time(any(duplicated(dat)))
#  user  system elapsed
# 1.632   0.007   1.671

This takes 25 seconds using anyDuplicated.data.frame.

# if you want to calculate based on just var1 and var2
system.time(any(duplicated(dat, by=c("var1", "var2"))))
#  user  system elapsed
# 0.492   0.001   0.495

This takes 7.4 seconds using anyDuplicated.data.frame.

Adding the `data.table` tag paid off. ;) Much faster, thanks. — tchakravarty, Apr 03 '14 at 12:05

score 2 · Answer 2 · answered Apr 03 '14 at 10:31

Perhaps anyDuplicated:

anyDuplicated( dfTemp[, c("Var1", "Var2", "Var3") ] )

or using dplyr:

dfTemp %.% select(Var1, Var2, Var3) %.% anyDuplicated()

This is still going to be wasteful though because anyDuplicated will first paste the columns into a character vector.

score 0 · Answer 3 · answered Apr 03 '14 at 10:27

0

How about:

length(unique(paste(dfTemp$var1, dfTemp$var2, dfTemp$var3)))==nrow(dfTemp)

Paste variables into one string, get unique, and compare the length of this vector with number of rows in your dataframe.

answered Apr 03 '14 at 10:27

zx8754

52,746
12
114
209

R: Checking if a set of variables forms a unique index

3 Answers3

Linked