1

For a coding challenge on a learning platform, i was asked to compute sampling errors for 100 different sample sizes. My approach does not generate the same values as the provided solution does, but I do not understand why - to me, they seem to be doing the same thing. Or am I missing anything? I am a coding beginner, so it is entirely possible that I am missing something!

Here is the setup for the challenge:

set.seed(4)
parameter  <-  mean(houses$SalePrice) # parameter value = 180796.1
sample_sizes  <-  seq(from = 5, by=29, length.out=100)
library(purrr)

Here is my approach:

sample_means <- map_dbl(sample_sizes, function(x) mean(sample(houses$SalePrice, size=x)))
sampling_errors_a <- parameter - sample_means

Here is the provided solution:

sampling_errors <- map_dbl(sample_sizes, function(x) parameter - mean(sample(houses$SalePrice, size=x)))

When I run identical(sampling_errors_a, sampling_errors), R keeps returning FALSE. I looked at the values of both vectors and, in fact, they are totally different.

I would love to understand why the 2 approaches do not arrive at the same solution. If somebody had a moment to spare to explain, I would very much appreciate it. Thank you to all of you in advance!

Andrea
  • 13
  • 3
  • 4
    you need to set seeds when doing random number generation – rawr Mar 31 '23 at 20:07
  • Most likely this is a case of floating-point precision limits. Take a careful read of `?identical` . But it would be helpful if you provide the source of `houses` and of `map_dbll` – Carl Witthoft Mar 31 '23 at 20:46
  • @CarlWitthoft: the clue that this is *not* a FAQ 7.31 issue is that the OP says the results are "totally different" – Ben Bolker Mar 31 '23 at 21:19
  • @BenBolker I'd agree except that one never knows what a new poster means by "totally" :-) – Carl Witthoft Mar 31 '23 at 21:29
  • @BenBolker: guys, you are right - "totally different" is not precise. I've only been coding for a few months, so I lack the experience which is reflected in my poor statement. Nonetheless, I will try to do better in the future and I appreciate that you all were so helpful - thank you! – Andrea May 15 '23 at 16:24

1 Answers1

1

tl;dr you have to reset the seed by running set.seed(4) again before you evaluate a different approach. (You don't need to run set.seed() immediately before the new code, but you need to make sure that the code between set.seed() and your new approach doesn't do anything that calls the pseudo-random number generator ...)

In this particular case, that will give you identical results. However, in general you should get in the habit of using all.equal() instead of identical() to compare floating-point results, as very subtle differences in computation can lead to small differences in results.

I've shortened some of the variable names etc. and used a randomly generated example instead of your house-price variable.

set.seed(101)                                                                                                                              
vals <- rnorm(1000)                                                                                                                        
                                                                                                                                           
set.seed(4)                                                                                                                                
p0 <- mean(vals)                                                                                                                           
                                                                                                                                           
## this is a weird specification - results in non-integer sample sizes ... ??
##  but if that's really what your instructor said to do, I guess you should use it                                                              
s <- seq(5, 29, length.out = 100)                                                                                                          
                                                                                                                                           
m <- purrr::map_dbl(s, ~ mean(sample(vals, size = .)))                                                                                     
result_1 <- p0 - m                                                                                                                         
                                                                                                                                           
set.seed(4)  ## THIS IS THE KEY STEP                                                                                                       
result_2 <- purrr::map_dbl(s, ~ p0 - mean(sample(vals, size = .)))                                                                         
                                                                                                                                           
identical(result_1, result_2) ## TRUE                                                                                                      
all.equal(result_1, result_2) ## TRUE      
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Well put .... I would recommend even simpler approach: `set.seed(number)` then `foo <- sample(houses$SalePrice, size=x)` , followed by sticking `foo` into the subsequent analysis equations. – Carl Witthoft Mar 31 '23 at 21:31
  • Thank you so much to all of you for taking the time to review and answer my question, I really appreciate it! Of course, I need to make sure that both approaches use the same set.seed() specifications - this should have occurred to me! It totally resolved the issue. Moreover, it I am relieved that my approach was not entirely wrong, though not as efficient as the solution. Again, thank you all so much! – Andrea Apr 02 '23 at 20:14
  • IMO there's barely any difference between your solution and the recommended one – Ben Bolker Apr 02 '23 at 22:45