Reproducibility: Failing to rerun code over time

Question

I fear that a running code could fail in the future. I've seen this with tidyverse functions that were running well but after a time returned an error because they had been Defunct. To give some reproducible example try this piece of code from How to make a great R reproducible example that ironically is not reproducible anymore (compare values of age and x to the original post):

set.seed(42)  ## for sake of reproducibility
n <- 6
dat <- data.frame(id=1:n, 
                  date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"),
                  group=rep(LETTERS[1:2], n/2),
                  age=sample(18:30, n, replace=TRUE),
                  type=factor(paste("type", 1:n)),
                  x=rnorm(n))
dat
  id       date group age   type           x
1  1 2020-12-26     A  29 type 1  0.63286260
2  2 2020-12-27     B  30 type 2  0.40426832
3  3 2020-12-28     A  21 type 3 -0.10612452
4  4 2020-12-29     B  28 type 4  1.51152200
5  5 2020-12-30     A  26 type 5 -0.09465904
6  6 2020-12-31     B  24 type 6  2.01842371

Question

Is it only after updates the case that the very same code returns a different output? In other words: packages and R itself usually do not update automatically, so does it mean I can rerun a function for a "eternity" as long as I do not update anything manually? Are there any exceptions?

Why I ask

I do the encryption of sensitive data for my company using the bcrypt package in R. We need to encrypt data and delete the original data. Once this is done there is no way back, i.e. I really have to trust the code. I use no pacakges but bcrypt, shiny and shinydashboard.

Edit

My question assumes that the code is being run on the same system without changing global settings (edit after comment from @qdread) with no changes to the R version.

What I do in detail: I work with patient data. Firstly, I choose a random ID that consists of letters and numbers for every patient, e.g. A72CV for Max Cooper 1987-05-03. In the next step I use bcrypt to create salts for every patient and then I create hashed/ encrypted versions of the IDs using the salts (salt + ID = encrypted ID). So every patient has name + birthdate, a random letters/ numbers ID, a salt (generated using salt <- bcrypt::gensalt(log_rounds = 12)) and the encrypted ID (generated using id_encrypted <- bcrypt::hashpw(id, salt = salt)). I save the data in three separated files: (i) patient data, i.e. name and birthdate, and encrypted ID, (ii) IDs and salts and (iii) the actual database with IDs and a number of variables of interest, e.g. smoker/ weight,... This approach is recommended by some institutions in the context where I work and it is called pseudonymisation (a reversible encryption). It ensures that even if there are data leaks there is no obvious connection between the identifying variables name + birthday and all the variables of interest (smoker,...). I made a shinyApp that allows my co-workers to (1) provide ID and look up name + birthdate, (2) provide name + birthdate and look up ID and (3) generate an ID for a new patient. This all works because the same ID with the same salt results in the same encrypted (hashed) ID - at least as for now this is the case. But if in future for some reasons the same input (e.g. ID) does not return the same output (e.g. name + birthdate) I am totally screwed. On the other hand, it is not a big problem if the generation of the random IDs will change over time because each ID is create and saved just once, i.e. this process does not have to be reproducible. The described encryption method will be applied to a few databases that took my institution many years to collect. If we can not recreate the data, all is lost. That is why code stability is so important to me. I will install shinyApp on windows computers of my colleagues. They will just hit run App inside R and then do one of the options described before (1 to 3).

[renv](https://rstudio.github.io/renv/articles/renv.html) may be helpful. — zephryl, Feb 15 '22 at 14:17
It seems like there are two different questions here. One is about the reproducibility of random number seed and one is about functions in tidyverse packages being not backwards compatible. Those are both legitimate questions but probably have different answers. The random seed might give different results based on your OS and what type of RNG you have set in your global options. The tidyverse issue is a whole different problem. — qdread, Feb 15 '22 at 14:18
@zephryl, `renv` will not fix things here, as this is dependent on the R version, not the packages used. — r2evans, Feb 15 '22 at 14:19
@Ben, I just ran this using `rocker/r-ver:3.5.2` and the OP's output of `age` is fixed but not `x` (as I said in my answer). I don't know ... I tried `RNGversion("1.7.0")` and got the same `x`, then `RNGversion("1.6.5")` and got a completely different `x`, so something else happened. — r2evans, Feb 15 '22 at 14:45
@Ben, but I think your suggestion to use a docker image might be one way forward for locking randomness and such into a constrained production environment such as this. — r2evans, Feb 15 '22 at 14:46

score 4 · Answer 1 · answered Feb 15 '22 at 14:42

(Partial answer.)

The default behavior of sample changed in R-3.6.0. Notable, in NEWS-3 under R-3.6.0, it states under SIGNIFICANT USER-VISIBLE CHANGES:

The default method for generating from a discrete uniform distribution (used in sample(), for instance) has been changed. This addresses the fact, pointed out by Ottoboni and Stark, that the previous method made sample() noticeably non-uniform on large populations. See PR#17494 for a discussion. The previous method can be requested using RNGkind() or RNGversion() if necessary for reproduction of old results. Thanks to Duncan Murdoch for contributing the patch and Gabe Becker for further assistance.

We can regain the age random values by changing the sample.kind="Rounding",

RNGkind(sample.kind = "Rounding")
# Warning in RNGkind(sample.kind = "Rounding") :
#   non-uniform 'Rounding' sampler used

set.seed(42)  ## for sake of reproducibility
n <- 6
dat <- data.frame(id=1:n, 
                  date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"),
                  group=rep(LETTERS[1:2], n/2),
                  age=sample(18:30, n, replace=TRUE),
                  type=factor(paste("type", 1:n)),
                  x=rnorm(n))
dat
#   id       date group age   type           x
# 1  1 2020-12-26     A  29 type 1  0.63286260
# 2  2 2020-12-27     B  30 type 2  0.40426832
# 3  3 2020-12-28     A  21 type 3 -0.10612452
# 4  4 2020-12-29     B  28 type 4  1.51152200
# 5  5 2020-12-30     A  26 type 5 -0.09465904
# 6  6 2020-12-31     B  24 type 6  2.01842371

As for the changed rnorm output, it was noted in the same link that

Note: The output of set.seed() differs between R >3.6.0 and previous versions. Specify which R version you used for the random process, and don't be surprised if you get slightly different results when following old questions. To get the same result in such cases, you can use the RNGversion()-function before set.seed() (e.g.: RNGversion("3.5.2")).

Unfortunately, I cannot reproduce the link's version of the x-column.

How to deal with it in production? It is always sketchy (for reasons such as this) to rely on truly random numbers in unit-tests, for two main reasons: you cannot always assumed that unseeded random values will hit the corner-cases you want; and seeded random values are subject to "bug-fixes" or improvements to the PRNG process, as you're seeing here.

Thanks! In both examples R-3.6.0 is mentioned. So one would get different outputs for the same code because one has updated R, right? Other way around: As long one does no updates nor changes to global setting all stays the same - can we say that? — LulY, Feb 15 '22 at 14:51
For `sample.int`, I could reproduce the other data very simply; I went all the way back to before R-1.7.0 and could not reproduce the `x` column, which means something else is happening that I did not discover. As for your statement, you should add in *"nor changes to the R version"* to be truly safe from changes to behavior in random numbers. I understand the frustrating nature of that. Your needs are a little stricter than most, though. — r2evans, Feb 15 '22 at 15:00
"Your needs are a little stricter than most, though." Yes, if the code fails to reproduce the encrypted data I am screwed.. If I add "nor changes to the R version", is it then true: As long one does no updates nor changes to global setting all stays the same - can we say that? — LulY, Feb 15 '22 at 15:01
I think so, yes: unchanged R-version, unchanged global settings ... should always yield the same stochastic results (on the premise that good encryption uses randomness). — r2evans, Feb 15 '22 at 15:19
This need to produce the same encrypted results ... is that for unit-testing, or is there another component to your process that needs perfect reproducibility in production? — r2evans, Feb 15 '22 at 16:01
Interesting data flow. I understand the need to protect PII while preserving traceability. — r2evans, Feb 15 '22 at 18:48
Yes, what I do is pseudonymisation, i.e. a reversible encryption (in opposite to anonymouzation where no-one is supposed to be able to recreate the data) — LulY, Feb 15 '22 at 18:54

Reproducibility: Failing to rerun code over time

1 Answers1

Linked