6

Original Question: My job requires me to frequently profile data on large data sets - cardinality, relationships, uniqueness, . . . ., and do this often. The intent is to use R for profiling the data and creating a report in R Markdown.

My problem is: 1. Data is loaded into R data frames 2. How do I identify what would be called a Composite Primary Key in database lingo?

For example, I have one table with 75,000 records. I do an rapply to get the unique counts on each variable. However, unless one of the variables has a count of 75,000 then there is no Single Primary Key. In other words, there is not one variable that can be used to uniquely identify a single observation.

The objective then becomes looking for a combination of variables (columns) that uniquely identify each observation (row). This could be two, three, or four variables/columns out of a 160 variable data frame/table. And, of course, there could always be duplicates where there is no combination of keys that uniquely identifies each row, or observation.

I have successfully done a 'for' loop (ugly), but thinking there is some elegant, more efficient way of doing this.

How do I find which variables constitute the COMPOSITE PRIMARY KEY?

Modified question:

    ############### Data1 - 2 columns - one PK
    data1 <- data.frame(rep_len(1, length.out = 10))
    data1$PK <- rep_len(1:10, length.out = 10)

    names(data1) <- c('DupData', 'PK')
    rownames(data1) <- NULL

    rapply(data1,function(x)length(unique(x)), how = 'unlist')
DupData      PK 
      1      10 

length(unique(data1$PK))
[1] 10

Next is a data frame with 3 columns, but two columns are required to make a unique observation:

############### Data2 - 3 columns - Two column composite PK
data2 <- data1
data2$PK <- rep_len(1:2, length.out = 10)
data2$PK2 <- rep_len(2:6, length.out = 10)
rapply(data2,function(x)length(unique(x)), how = 'unlist')
DupData      PK     PK2 
      1       2       5 

length(unique(data2$DupData))
[1] 1
length(unique(data2$PK))
[1] 2
length(unique(data2$PK2))
[1] 5
nrow(unique(data2[,c(1,2)], nmax = 3))
[1] 2
nrow(unique(data2[,c(1,3)], nmax = 3))
[1] 5
nrow(unique(data2[,c(2,3)], nmax = 3))
[1] 10

Lastly, there is one data frame with 4 columns/variables, and it requires three columns to make a unique observation:

############### Data3 - 4 columns - Three column composite PK
data3 <- data1
data3$PK <- c(0,0,0,0,0,0,0,0,1,1)
data3$PK2 <- c(0,0,1,1,1,2,2,2,0,0)
data3$PK3 <- c(1,2,0,1,2,0,1,2,0,1)
rapply(data3,function(x)length(unique(x)), how = 'unlist')
DupData      PK     PK2     PK3 
      1       2       3       3 

length(unique(data3$DupData))
[1] 1
length(unique(data3$PK))
[1] 2
length(unique(data3$PK2))
[1] 3
length(unique(data3$PK3))
[1] 3
nrow(unique(data3[,c(1,2)], nmax = 4))
[1] 2
nrow(unique(data3[,c(1,3)], nmax = 4))
[1] 3
nrow(unique(data3[,c(1,4)], nmax = 4))
[1] 3
nrow(unique(data3[,c(1:4)], nmax = 4))
[1] 10
nrow(unique(data3[,c(2,3)], nmax = 4))
[1] 4
nrow(unique(data3[,c(2,4)], nmax = 4))
[1] 5
nrow(unique(data3[,c(3,4)], nmax = 4))
[1] 9
nrow(unique(data3[,c(2:4)], nmax = 4))
[1] 10

The question is: is there a way of determining which columns combined constitute a unique instance of the record in a simple, eloquent way, without writing an endless for loop?

If there is not, what is the best way to write the for loop in R, where it would tell you every combination of columns that combined, have a unique count that is equal to the count of the entire data frame?

Hopefully, this is clearer than mud, and this is a simple problem for someone.

Thanks for your help!

Shanemeister
  • 667
  • 7
  • 13
  • Welcome to SO. Please read up on [ask] and how to create a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and edit your question accordingly. – Heroka Sep 30 '15 at 11:51
  • It's not much clear what you are asking. There can be likely many combinations of columns that could be composite keys. Which combination would you choose if such a case? You can test if no composite key exists through `nrow(unique(df))==nrow(df)`. If the command gives `FALSE` there is no key. – nicola Sep 30 '15 at 11:54
  • Hopefully the addition of sample data makes it clearer what I am looking for. I understand there can be many combinations that uniquely identify a row, but in most instances of a tidy data set, you would have a set number of columns that identify the row/observation. For example, a PERSON table could have one surrogate key, but to identify where that person lived, you would likely have a location table mapped to the person (two keys), and if you wanted history (when they moved into, out and back to that address) you might also add a date field (three keys). Hope I am being clear. – Shanemeister Oct 21 '15 at 21:03
  • Did you ever come up with a convenient solution to this? – Jean V. Adams Dec 01 '16 at 00:44
  • I have the same question. Did you ever come up with a convenient solution to this? – Geet Jun 16 '17 at 18:46
  • No, I have not found a convenient solution. I believe the answer is that all possible column options would soon become too large -- as n! large. – Shanemeister Oct 09 '19 at 19:35

2 Answers2

0

Unfortunately, no. There are several ways to identify the primary key, whether a single or composite key. However, if there are 10 columns, in theory, it could be 10 columns required to make a unique key. This means you would have to check the uniqueness of the first through the 10th column, followed by column 1 and column 2, followed by column 1 and column 3, followed by column 1 and column 4, . . . and so on. I think it would come down to a checking n! combinations. So if you have 10 columns, you could potentially have to check the uniqueness of 3,628,800 combinations. In reality, 2-4 columns is usually the max number of keys in a composite key. However, this can still be a large number of checks to verify. Just my opinion, but it boils down to the modeler knowing the data and verifying assumptions. If you find a better answer, please let us know,

Shanemeister
  • 667
  • 7
  • 13
-1

Typically, you should be TOLD which key is primary by people who built the DB or the schema. It is not a data driven exercise.

If you are the one who is starting to build out essentially the schema work, then the only thing you do is try and error. I am from health care industry. For example, claim number is often top suspect for primary key, but what if you have three similar claim numbers on the file?

It is often domain thing. For example, if the match rate between two tables should be >85%, and there is only one key on both tables (their names can be very different) that gives you that, then that is the one.

Complicating thing is also if the match is one to one, one to many or many to many.

opt135
  • 141
  • 1
  • 8
  • Please don't take this the wrong way, but the question I think implies that I am doing the database design work, or in some cases reverse engineering a poorly designed database and applying modifications to the schema. As for the 85% common on joining two tables, that is absolutely incorrect. A foreign key can have 1 match, or a 100% match, especially if it is an identifying foreign key relationship. – Shanemeister Apr 07 '18 at 16:01