0

I'm trying to identify 'unique' and 'near unique' cases or records from a dataset for a disclosure control project. Particularly combinations on certain variables that only appear once, twice etc.

The records appear in:

table(Age,Sex,Ethnicity)

I am most interested in the elements (which are TRUE) of:

table(Age,Sex,Ethnicity)==1 
table(Age,Sex,Ethnicity)==2 

I know there are roughly 150 cases I am looking for from:

sum(table(Age,Sex,Ethnicity)==1)

there is also an identifier in the dataset which be a nice output or a number from 1:length(Age)*length(Sex)*length(Ethnicity) would be just as good. I was hoping it return a list something like:

[1] 103 207 218....   
[41] * * *
[81] * * *

where 'identifier' = 103, 207 and 218 for the first 3 of 150 cases where:

table(Age,Sex,Ethnicity)==1

I was naively hoping that something like:

data$identifier[table(age,sex,mar,emp,edu) == 1]    
names(table(Age,Sex,Ethnicity) 

would work but no such luck. I've also looked into unique() but that returns every combination (that occurs once or more). Any help or input would be much appreciated.

Added reproducible example (hopefully) Example

set.seed(1234)
a <- 1+rpois(100,1)
b <- 1+rpois(100,1)
c <- 1+rpois(100,1)
a[a >= 5] <- 4
b[b >= 5] <- 4
c[c >= 5] <- 4
eg <- cbind(1:100,a,b,c)
(sum(table(a,b,c)==1))

should have 12 'unique' combinations, which I would like to identify using the first column of eg (or the identifier from the dataset)

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Iain
  • 25
  • 5
  • 2
    Please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – akrun Jul 13 '15 at 17:21
  • In your "reproducible example", I am not sure how you got 12 "unique combination". Try this hack: 'length(unique(paste0(a,b,c)))' gives 37 combinations. Do 'sort(table(paste0(a,b,c)))' will show you the number of time it appear starting from least to most common. – fishtank Jul 13 '15 at 18:09

1 Answers1

0

I think the easiest way to do this is using the data.table package:

library(data.table)
eg.dt <- as.data.table(eg)
eg.dt[, list(N=.N), by=.(a,b,c)][N==1]

How this works: eg.dt[, list(N=.N), by=.(a,b,c)] counts up the number of occurences of each (a,b,c) combination. The [N==1] filters out those that occur precisely once.

Or if you want to stick with dataframes (not data.table) try plyr:

library(plyr)
eg <- data.frame(eg)
subset(ddply(eg, .(a, b, c), nrow), V1 == 1)

This works in the same way: the ddply(eg, .(a, b, c), nrow) makes a dataframe with a column "V1" being the number of times that combination occurs; then you just subset it for combinations that occur once only.

I think there might be a way to do it with your table(a,b,c) but I can't think of one that isn't convoluted.

mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • How about `eg <- cbind(1:100,a,b,c); rownames(eg)<-paste0(a,b,c)` then `eg[names(which(table(rownames(eg))==1)),]` will show you the row index and the combination that appears uniquely. – fishtank Jul 14 '15 at 17:02
  • @user2540309 because this fails if you have e.g. `(a,b,c) = (12, 3, 4)` and `(a,b,c) = (1, 23, 4)`, so you have to mess around with pasting witih a separator that will not appear in the a, b, or c (which is what `match` does I think). But it's not particularly elegant - all that pasting and conversion to and from string simply to do a lookup that could be achieved quicker with plyr/data.table/etc – mathematical.coffee Jul 14 '15 at 23:16
  • @user2540309 Thanks both of you for your help. I ended up using [names(which...] and paste0() which was quite similar to what I had in mind. Only one of my variables is double digits (and only up to 15 ) so separators were not an issue – Iain Jul 28 '15 at 15:48