0

I know the names of four individuals, and an interval within which each was born (given by the birth_low and birth_high columns):

> df <- data.frame(id = c(1:4), name = c("john", "john", "leo", "anna"), birth_low = dmy(c("01/01/1978", "01/01/1978", "01/03/1979", "01/03/1979")), birth_high = dmy(c("31/12/1978", "31/12/1978", "30/03/1979", "01/04/1979")))
> df
 id name  birth_low  birth_high
 1  john 01/01/1978  31/12/1978
 2  john 01/01/1978  31/12/1978
 3  leo  01/03/1979  30/03/1979
 4  anna 01/03/1979  01/04/1979

I need to write a reproducible code to assign a random date of birth DoB to each record. Other considerations require me to use a loop for this:

> for (n in 1:nrow(df)) {
   set.seed(n)
   date <- runif(1,df$birth_low[n], df$birth_high[n])
   date <- ceiling(date) # round up float number
   date <- dmy("01/01/1970") + date 
   date <- format(date, "%d/%m/%Y")
   df$DoB[n] <- date
  }
> df$DoB
 [1] "07/04/1978" "09/03/1978" "05/03/1979" "19/03/1979"

An obvious issue with the code above is that it uses n to set the seed for every iteration. I will constantly by inputing new values, and if another person in df[1,] had the same values for birth_low and birth_high, then the same "random" date would be produced ("07/04/1978").

I thought of determining the seed through the length of the name or a combination of letters, but these alternatives yield a similar problem (e.g. every "john" in the first row will have the same seed). So the problem really is how to set the seed within a loop in a way that is independent from the data, yet still reproducible.

Any ideas?

InspectorSands
  • 2,859
  • 1
  • 18
  • 33
  • 1
    Well what about the `id` column? As long as you always append new data that will work fine? – Stephen Henderson Feb 09 '16 at 23:01
  • 2
    If you set the seed once outside of the loop, the results of the loop would still be reproducible, wouldn't it? Or do you need to be able to independently then check a given row of your data? If the latter, perhaps you can combine the row number and name and use something like what [was suggested here](http://stackoverflow.com/a/10913336/1270695) to generate a seed using an alphanumeric input. – A5C1D2H2I1M1N2O1R2T1 Feb 09 '16 at 23:05
  • @ StephenHenderson The data come from household questionnaires, so the `id` column is really the line number within each questionnaire. I see your point, but I don't think it will be technically convenient for me to append all the new records to a single data frame. I'll be doing record linkage to try and match people reported in different questionnaires, so it can get a bit messy.. – InspectorSands Feb 09 '16 at 23:09
  • @AnandaMahto, the problem with setting the seed outside of the loop is that it would deliver the same `DoB` for every case where `birth_low` and `birth_high` have the same values. As for combining the row number and name, the (slightly less critical) issue is that all people called `"john"` registered in row 1 will have the same seed. And in the (unlikely, but possible) case that two johns had the same values for `birth_low` and `birth_high`, then the same "random" date would be produced. – InspectorSands Feb 09 '16 at 23:18
  • @D.Alburez, I think you're mistaken about what happens when setting the seed outside of the loop. As for my other suggestion, I'm presuming there are more than two columns that can be used to make a potentially unique alphanumeric identifier. – A5C1D2H2I1M1N2O1R2T1 Feb 09 '16 at 23:20
  • @AnandaMahto, my objection to setting the seed outside of the loop is this: imagine a df2 with the same `birth_low` and `birth_high` values, but different names (ie. different people reported the same intervals in the same order). Amittedly, this is quite unlikely, but I think this would produce the same `DoB` values (since the seed is the same) as in the original df. Please correct me if I'm mistaken. I guess a solution would then be to change the seed outside the loop every time data is entered. I think I can do it using unique questionnaire identifiers. – InspectorSands Feb 09 '16 at 23:59
  • Ananda Mahto is right, just set it once outside the loop. The `min` and `max` of `runif` don't matter in terms of the randomness; it just uses that to scale up the same sequence of random numbers determined by `set.seed`. Calling `runif(1, 0, 10)` and then `runif(1, 0 10)` again produces different numbers. That's the point, really. – alistaire Feb 10 '16 at 01:55
  • @alistaire, thanks for your comment. You are of course right. I can see why Ananda's answer is less redudant than what I had in mind. – InspectorSands Feb 11 '16 at 01:06

0 Answers0