15

I often conduct research on human participants. For various reasons my preliminary identifier is sometimes a composite of information that reduces anonymity in the data (e.g., I might concatenate a string that include date and time of completion, IP address, and some information supplied by the participant).

Thus, if the data is to be shared in some form, a cleansed ID needs to be created from the preliminary ID. The cleansed ID needs to be stripped of such information. A simple approach in R is just to assign consecutive numbers (e.g., df$id <- seq(nrow(df)) where df is the data.frame). However, if in the initial phase of research more data is collected or the rows are resorted, this can cause problems. I.e., the cleansed ID assigned to a given participant may vary each time the raw dataset is updated. This in turn can break subsequent analyses on the cleansed dataset that for example may have filtered cases based on cleansed ID.

Thus, I thought about creating a hash using the digest function in the digest package.

df$id <- sapply(df$raw_id, digest)

This would seem to lead to a reliable way of going from raw identifier to cleansed identifier, but it would be impossible to get the raw identifier for anyone who only possessed the cleansed identifier.

However, given that I am new to both the digest function and hashing in general, I wanted to ask:

  • Is the digest suitable for stripping IDs of identifying information?
  • Are there any issues in using digest for this purpose?
Jeromy Anglim
  • 33,939
  • 30
  • 115
  • 173
  • 2
    The problem with a hash is that it's not collision-proof... It all depends on the size of your dataset and the damage which will be caused if you do get a conflict. – Simon MᶜKenzie Apr 08 '13 at 05:02
  • I take your point. I first tested it on an actual sample of 400 ID strings and had no collisions. I then ran `sum(duplicated(sapply(rnorm(100000), digest)))` and obtained no duplicates. Thus, I'm thinking for my applications (i.e., academic research; fairly small samples in the hundreds or perhaps thousands i.e., not large automated organisational databases) collisions are sufficiently unlikely to not be an issue. But correct me if I'm wrong. – Jeromy Anglim Apr 08 '13 at 05:47
  • 2
    From a 5-minute look (and very basic knowledge of hashes), you could make it more collision-resistant by using SHA-1, which has a longer output: `sapply(df$raw_id, digest, algo="sha1")`. You can also insert a check after hashing with `with(df, length(unique(id)) == length(unique(raw_id)))`. – Blue Magister Apr 08 '13 at 06:08
  • 5
    Please keep in mind that hash functions do not encrypt data. Rather, they just substitute one set of bits for a mathematically related other set. So if one knows the components that are hashed and the function used, they can re-create the hash. If I know that someone participated in your study, and I know what personal info you hashed (say, "firstname.lastname"), then I can recreate the hash and see what data you obtained on that person. – BenBarnes Apr 08 '13 at 06:12
  • Thanks Ben. That's a good point. I imagine for most of my applications the format would not be known by others and the actual specific data like time of completing starting study to the second would not be known. That said, I wonder what would be a simple way of encrypting the hash in R? I guess I could just add a password constant to the raw ID that I keep secret. – Jeromy Anglim Apr 08 '13 at 06:37
  • 1
    @BenBarnes If OP is using date/time down to the second, the hash input would be much harder to guess. Additionally, OP could add in some random characters to the hash input to make it near-unguessable. – Blue Magister Apr 08 '13 at 06:40
  • 2
    Wow, according to [Wikipedia](https://en.wikipedia.org/wiki/SHA-1), SHA-1 theoretically has hash collisions but none have been found. So you should be fine. [Related question](http://stackoverflow.com/questions/5806308/how-do-i-encrypt-data-in-r) about asymmetric encryption suggests that `digest` is viable for your purposes. – Blue Magister Apr 08 '13 at 06:42
  • @BlueMagister, good suggestion with the random characters. That's a [salt](http://en.wikipedia.org/wiki/Salt_%28cryptography%29). – BenBarnes Apr 08 '13 at 06:57

1 Answers1

13

I have learnt many helpful things from the comments above. This answer aims to distill these comments.

There are two issues with hashing for the purpose of anonymising research participant identifiers:

  • Duplicate IDs: This seems to only be a theoretical, but not a practical issue (possibly especially if you use sha1). But I'm happy to be corrected on this.
  • Lack of anonymity: If you know the hashing algorithm, and you know the id format and you know the exact information making up the id, then you'll be able to work out which participant matches that information. In many cases where the format is not shared, participant information is not known, or the ID uses information that is virtually unknowable, then this is really not an issue. Nonetheless, adding some password text to the ID seems to be a simple solution for preventing this from being an issue.

Thus, to summarise the recommendations that I've gathered.

library(digest)
hashed_id <- function(x, salt) {
    y <- paste(x, salt)
    y <- sapply(y, function(X) digest(X, algo="sha1"))
    as.character(y)
}

mydata$id <- hashed_id(mydata$raw_id, "somesalt1234")
Jeromy Anglim
  • 33,939
  • 30
  • 115
  • 173
  • 1
    Yes, that's a good answer. My only note is that the length of `password` (roughly a cryptographic "salt") is basically irrelevant, as long as it's not super-short, or otherwise guessable without reference to the source code. – Harlan Apr 08 '13 at 11:29
  • 1
    @Harlan Okay. converted it to a function and removed that point. – Jeromy Anglim Nov 04 '17 at 05:35