I fear that a running code could fail in the future. I've seen this with tidyverse
functions that were running well but after a time returned an error because they had been Defunct. To give some reproducible example try this piece of code from How to make a great R reproducible example that ironically is not reproducible anymore (compare values of age
and x
to the original post):
set.seed(42) ## for sake of reproducibility
n <- 6
dat <- data.frame(id=1:n,
date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"),
group=rep(LETTERS[1:2], n/2),
age=sample(18:30, n, replace=TRUE),
type=factor(paste("type", 1:n)),
x=rnorm(n))
dat
id date group age type x
1 1 2020-12-26 A 29 type 1 0.63286260
2 2 2020-12-27 B 30 type 2 0.40426832
3 3 2020-12-28 A 21 type 3 -0.10612452
4 4 2020-12-29 B 28 type 4 1.51152200
5 5 2020-12-30 A 26 type 5 -0.09465904
6 6 2020-12-31 B 24 type 6 2.01842371
Question
Is it only after updates the case that the very same code returns a different output? In other words: packages and R
itself usually do not update automatically, so does it mean I can rerun a function for a "eternity" as long as I do not update anything manually? Are there any exceptions?
Why I ask
I do the encryption of sensitive data for my company using the bcrypt
package in R
. We need to encrypt data and delete the original data. Once this is done there is no way back, i.e. I really have to trust the code. I use no pacakges but bcrypt
, shiny
and shinydashboard
.
Edit
My question assumes that the code is being run on the same system without changing global settings (edit after comment from @qdread) with no changes to the R version.
What I do in detail: I work with patient data. Firstly, I choose a random ID that consists of letters and numbers for every patient, e.g. A72CV
for Max Cooper 1987-05-03
. In the next step I use bcrypt
to create salts for every patient and then I create hashed/ encrypted versions of the IDs using the salts (salt + ID = encrypted ID). So every patient has name + birthdate, a random letters/ numbers ID, a salt (generated using salt <- bcrypt::gensalt(log_rounds = 12)
) and the encrypted ID (generated using id_encrypted <- bcrypt::hashpw(id, salt = salt)
). I save the data in three separated files: (i) patient data, i.e. name and birthdate, and encrypted ID, (ii) IDs and salts and (iii) the actual database with IDs and a number of variables of interest, e.g. smoker/ weight,... This approach is recommended by some institutions in the context where I work and it is called pseudonymisation (a reversible encryption). It ensures that even if there are data leaks there is no obvious connection between the identifying variables name + birthday and all the variables of interest (smoker,...). I made a shinyApp that allows my co-workers to (1) provide ID and look up name + birthdate, (2) provide name + birthdate and look up ID and (3) generate an ID for a new patient. This all works because the same ID with the same salt results in the same encrypted (hashed) ID - at least as for now this is the case. But if in future for some reasons the same input (e.g. ID) does not return the same output (e.g. name + birthdate) I am totally screwed. On the other hand, it is not a big problem if the generation of the random IDs will change over time because each ID is create and saved just once, i.e. this process does not have to be reproducible. The described encryption method will be applied to a few databases that took my institution many years to collect. If we can not recreate the data, all is lost. That is why code stability is so important to me. I will install shinyApp on windows computers of my colleagues. They will just hit run App
inside R
and then do one of the options described before (1 to 3).