2

Basically the idea is for the values n=10,20,30,...100 to take the mean of 10,000 random samples, saving the 10,000 means for later usage.

In a language I'm more accustomed to, I would create a hashmap using each n as a key, and a list of means as the value.

In javascript for example:

var mydata
var map = {}

for (int i = 10; i <= 100; i += 10 ) {
  map[i] = [] // create list
  for (int j = 0; j < 10000; j++) {
    map[i][j] = mean(sample(mydata, i))
  }
}

Now I attempted to do this in R (this is my first time using it), and I ended up with:

hashmap  <- new.env()
sunspots <- read.table("sunspots.txt")

for (i in seq(10, 100, by=10)) {
  hashmap[[i]] <- c()
  for (j in 1:10000) {
    hashmap[[i]][j] <- mean(sample(sunspots$x, i))
  }
}

However this throws an error:

wrong args for environment subassignment

Even if it didn't throw this error, I'm not entirely sure if I'm approaching it the right way.

Could someone help me understand the proper way to go about this?

m0meni
  • 16,006
  • 16
  • 82
  • 141
  • hmmm...I figure I could actually use a matrix to achieve this, but I'm curious to know the actual solution using a hashmap. – m0meni Feb 17 '16 at 01:07

2 Answers2

2

The issue here is that i is a numeric, and environments must be keyed by character strings. Thus your immediate problem can be solved with a simple as.character() coercion on the i variable when it's used to index hashmap.

I would also recommend you refactor the inner loop into a vectorized function call, such as replicate(). Here's how I would do this:

set.seed(1L);
test.data <- 1:200;
N <- 3L;
e <- new.env();
for (i in seq(10L,100L,10L)) e[[as.character(i)]] <- replicate(N,mean(sample(test.data,i)));

Result:

ls(e);
##  [1] "10"  "100" "20"  "30"  "40"  "50"  "60"  "70"  "80"  "90"
for (i in seq(10L,100L,10L)) print(e[[as.character(i)]]);
## [1] 108.3 109.4  82.4
## [1] 108.50  93.65 106.20
## [1] 103.3333  96.0000 101.2333
## [1] 98.075 95.250 83.275
## [1] 106.68  97.48 107.34
## [1]  97.48333 105.95000  98.76667
## [1] 101.8857 102.4857 114.6000
## [1]  99.5875 107.0875  96.0750
## [1]  92.9000 103.0889 100.7889
## [1]  91.19  99.80 101.57

You can change N to 10000 and test.data to sunspots for your real data.


Also, here's an alternative that produces a matrix output, built around the convenient feature of sapply() that it returns a matrix for multi-element return values from FUN():

set.seed(1L);
sapply(seq(10L,100L,10L),function(i) replicate(N,mean(sample(test.data,i))));
##       [,1]   [,2]     [,3]   [,4]   [,5]      [,6]     [,7]     [,8]     [,9]  [,10]
## [1,] 108.3 108.50 103.3333 98.075 106.68  97.48333 101.8857  99.5875  92.9000  91.19
## [2,] 109.4  93.65  96.0000 95.250  97.48 105.95000 102.4857 107.0875 103.0889  99.80
## [3,]  82.4 106.20 101.2333 83.275 107.34  98.76667 114.6000  96.0750 100.7889 101.57
bgoldst
  • 34,190
  • 6
  • 38
  • 64
  • I'm a newcomer so please bear with me here. What's the significance of `L`, what does `set.seed` do, and what's the advantage of `replicate` over a loop...performance? – m0meni Feb 17 '16 at 01:30
  • 1
    The `L` suffix changes the internal type of the numeric literal from double to integer. The benefit of using `replicate()` over a loop is partially performance, and partially concision. – bgoldst Feb 17 '16 at 01:33
  • Ah ok thanks. What about `set.seed`? I read that it sets the seed for the RNG, but isn't it bad to have a constant seed? – m0meni Feb 17 '16 at 01:36
  • 1
    `set.seed()` sets the PRNG seed. It can be useful for rerunning tests, since the PRNG returns the same stream of random numbers when initialized with the same seed. It's recommended to call `set.seed()` at the start of any [MCVE](http://stackoverflow.com/help/mcve)/[MWE](https://en.wikipedia.org/wiki/Minimal_Working_Example)/[SSCCE](http://sscce.org/)/whatever, so everyone who runs the code gets the exact same result. Correct, often you don't want a constant seed for security reasons, but security is not an issue here. – bgoldst Feb 17 '16 at 01:37
  • Hey one final question. Is this the right way to go about solving this problem? I've never written R before so I'm curious how a more experienced person like you would approach it. – m0meni Feb 17 '16 at 01:47
  • 1
    You're doing everything right. Just be careful about performance; R is notorious for making it too easy to write poorly-performing code, especially if lots of looping is involved. – bgoldst Feb 17 '16 at 01:59
1

Wouldn't this be the same, but simpler and more readable?

set.seed(123)
N = 10000
sunspots <- rnorm(N, 10, 2)

sim <- lapply(seq(10, 100, by=10), function(i){
  sapply(1:N, function(j){
    mean(sample(sunspots, i))
   })
})

lapply(sim, head)

It would make sense, as replicate is just an sapply call.

> replicate
function (n, expr, simplify = "array") 
sapply(integer(n), eval.parent(substitute(function(...) expr)), 
    simplify = simplify)
<bytecode: 0x19b0b7108>
<environment: namespace:base>

EDIT

As mentioned in the comments.

simulation <- function(data, i){
  sapply(1:N, function(j) mean(sample(data, i)))
}

sim <- lapply(seq(10, 100, by=10), function(i) simulation(sunspots, i))

# This would give the same output. 
do.call(cbind, lapply(sim, head))

# You could potentially use sapply on the first level also. 
sim <- sapply(seq(10, 100, by=10), function(i) simulation(sunspots, i))

str(sim)
marbel
  • 7,560
  • 6
  • 49
  • 68
  • Why `lapply` instead of the for loop? Also what ends up inside of `sim`? – m0meni Feb 17 '16 at 02:33
  • Here is an [answer](http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar) to that. I normally choose lapply as it makes the code cleaner and there is no need to pre-define an object. You could potentially create a function with the anonymous function. `sim` would be a list, where each element of the list is a `vector`, just type `str(sim)` in the console. – marbel Feb 17 '16 at 02:44
  • Okay so `lapply` is basically the equivalent of `fmap` in a functional programming language right? My final question is how does `lapply(sim, head)` work? I don't understand the intuition behind it. I ran it in the REPL and it returns a matrix, but the rows 1-10 only have 6 values in them. – m0meni Feb 17 '16 at 02:54
  • `lapply` would return a list, so you coud apply functions to that list, for example `head`, `mean`, `quantile`. – marbel Feb 17 '16 at 03:06