Original Question: My job requires me to frequently profile data on large data sets - cardinality, relationships, uniqueness, . . . ., and do this often. The intent is to use R for profiling the data and creating a report in R Markdown.
My problem is: 1. Data is loaded into R data frames 2. How do I identify what would be called a Composite Primary Key in database lingo?
For example, I have one table with 75,000 records. I do an rapply to get the unique counts on each variable. However, unless one of the variables has a count of 75,000 then there is no Single Primary Key. In other words, there is not one variable that can be used to uniquely identify a single observation.
The objective then becomes looking for a combination of variables (columns) that uniquely identify each observation (row). This could be two, three, or four variables/columns out of a 160 variable data frame/table. And, of course, there could always be duplicates where there is no combination of keys that uniquely identifies each row, or observation.
I have successfully done a 'for' loop (ugly), but thinking there is some elegant, more efficient way of doing this.
How do I find which variables constitute the COMPOSITE PRIMARY KEY?
Modified question:
############### Data1 - 2 columns - one PK
data1 <- data.frame(rep_len(1, length.out = 10))
data1$PK <- rep_len(1:10, length.out = 10)
names(data1) <- c('DupData', 'PK')
rownames(data1) <- NULL
rapply(data1,function(x)length(unique(x)), how = 'unlist')
DupData PK
1 10
length(unique(data1$PK))
[1] 10
Next is a data frame with 3 columns, but two columns are required to make a unique observation:
############### Data2 - 3 columns - Two column composite PK
data2 <- data1
data2$PK <- rep_len(1:2, length.out = 10)
data2$PK2 <- rep_len(2:6, length.out = 10)
rapply(data2,function(x)length(unique(x)), how = 'unlist')
DupData PK PK2
1 2 5
length(unique(data2$DupData))
[1] 1
length(unique(data2$PK))
[1] 2
length(unique(data2$PK2))
[1] 5
nrow(unique(data2[,c(1,2)], nmax = 3))
[1] 2
nrow(unique(data2[,c(1,3)], nmax = 3))
[1] 5
nrow(unique(data2[,c(2,3)], nmax = 3))
[1] 10
Lastly, there is one data frame with 4 columns/variables, and it requires three columns to make a unique observation:
############### Data3 - 4 columns - Three column composite PK
data3 <- data1
data3$PK <- c(0,0,0,0,0,0,0,0,1,1)
data3$PK2 <- c(0,0,1,1,1,2,2,2,0,0)
data3$PK3 <- c(1,2,0,1,2,0,1,2,0,1)
rapply(data3,function(x)length(unique(x)), how = 'unlist')
DupData PK PK2 PK3
1 2 3 3
length(unique(data3$DupData))
[1] 1
length(unique(data3$PK))
[1] 2
length(unique(data3$PK2))
[1] 3
length(unique(data3$PK3))
[1] 3
nrow(unique(data3[,c(1,2)], nmax = 4))
[1] 2
nrow(unique(data3[,c(1,3)], nmax = 4))
[1] 3
nrow(unique(data3[,c(1,4)], nmax = 4))
[1] 3
nrow(unique(data3[,c(1:4)], nmax = 4))
[1] 10
nrow(unique(data3[,c(2,3)], nmax = 4))
[1] 4
nrow(unique(data3[,c(2,4)], nmax = 4))
[1] 5
nrow(unique(data3[,c(3,4)], nmax = 4))
[1] 9
nrow(unique(data3[,c(2:4)], nmax = 4))
[1] 10
The question is: is there a way of determining which columns combined constitute a unique instance of the record in a simple, eloquent way, without writing an endless for loop?
If there is not, what is the best way to write the for loop in R, where it would tell you every combination of columns that combined, have a unique count that is equal to the count of the entire data frame?
Hopefully, this is clearer than mud, and this is a simple problem for someone.
Thanks for your help!