How to do large combinations with condition in R efficiently?

Question

Survey shows average score of 4.2 out of 5, with sample size of 14. How do I create a dataframe that provides a combination of results to achieve score of 4.2?

I tried this but it got too big

library(tidyverse)

n <- 14
avg <- 4.2
df <- expand.grid(rep(list(c(1:5)),n))
df <- df %>%
rowwise() %>%
mutate(avge = mean(c_across())) %>%
filter(ave >= 4)

The aim for this is, given the limited information above, I want to know the distribution of combinations of individual scores and see which combination is more likely to occur and how many low scores + high scores needed to have an average of that score above.

Thanks!

What are you aiming to achieve with the line `expand.grid(rep(list(c(1:5)),n))`? It looks like you're looking to create every possible combination of scores 1:5 for 14 individuals, which needs 5^14 = ~6 billion possibilities, each with 14 columns. Enumerating them all may push the limits of your RAM. — Jon Spring, Jul 30 '21 at 18:05
"may push the limits" --> for 14 samples, I think that'll take around 350 GB of RAM. That will take a formidable computer. — Jon Spring, Jul 30 '21 at 18:12
@JonSpring, I think that's exactly the problem here ... I'm inferring that they are generalizing the `n` so that they may choose 4 or 14 or 48 or something else, in which case ... we're off to the races wrt memory usage. — r2evans, Jul 30 '21 at 18:12
Thanks for the comments. I'll edit my post to include the aim, perhaps there is another better way to solve this. — KKW, Jul 30 '21 at 20:22

r2evans · Accepted Answer · 2021-07-30T18:24:26.047

If you can tolerate doing this randomly, then

set.seed(42) # only so that you get the same results I show here
n <- 14
iter <- 1000000
scores <- integer(0)
while (iter > 0) {
  tmp <- sample(1:5, size = n, replace = TRUE)
  if (mean(tmp) > 4) {
    scores <- tmp
    break
  }
  iter <- iter - 1
}
mean(scores)
# [1] 4.142857
scores
#  [1] 5 3 5 5 5 3 3 5 5 2 5 5 4 3

Notes:

The reason I use iter in there is to preclude the possibility of an "infinite" loop. While here it reacts rather quickly and is highly unlikely to go that far, if you change the conditions then it is possible your conditions could be infeasible or just highly improbable. If you don't need this, then remove iter and use instead while (TRUE) ...; you can always interrupt R with Escape (or whichever mechanism your IDE provides).
The reason I prefill scores with an empty vector and use tmp is so that you won't accidentally assume that scores having values means you have your average. That is, if the constraints are too tight, then you should find nothing, and therefore scores should not have values.

FYI: if you're looking for an average of 4.2, two things to note:

change the conditional to be what you need, such as looking for 4.2 ... but ...
looking for floating-point equality is going to bite you hard (see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754), I suggest looking within a tolerance, perhaps

tol <- 0.02
# ...
  if (abs(mean(tmp) - 4.2) < tol) {
    scores <- tmp
    break
  }
# ...

where tol is some meaningful number. Unfortunately, using this seed (and my iter limit) there is no combination of 14 votes (of 1 to 5) that produce a mean that is within tol = 0.01 of 4.2:

set.seed(42)
n <- 14
iter <- 100000
scores <- integer(0)
tol <- 0.01
while (iter > 0) {
  tmp <- sample(1:5, size = n, replace = TRUE)
  # if (mean(tmp) > 4) {
  if (abs(mean(tmp) - 4.2) < tol) {
    scores <- tmp
    break
  }
  iter <- iter - 1
}

iter
# [1] 0     # <-- this means the loop exited on the iteration-limit, not something found
scores
# integer(0)

if you instead set tol = 0.02 then you will find something:

tol <- 0.02
# ...
scores
#  [1] 4 4 4 4 4 5 4 5 5 5 3 4 3 5
mean(scores)
# [1] 4.214286

thanks. I think using random sampling might be a good idea. I want a distribution of combinations though. Basic idea is to see which combination is most likely to occur. — KKW, Jul 30 '21 at 20:53
`replicate(10, zz)` this process (10 is the number of replications, ergo 10 possible combinations), and you are likely to get a distribution of possible respondents. (`zz` here is the `while` loop above.) This will return a `matrix`. — r2evans, Jul 30 '21 at 20:58

score 1 · Answer 2 · answered Jul 30 '21 at 23:23

1

You can try the code below

n <- 14
avg <- 4.2
repeat{
  x <- sample(1:5, n, replace = TRUE)
  if (sum(x) == round(avg * n)) break
}

and you will see

> x
 [1] 5 5 5 5 5 5 4 5 5 4 1 5 1 4
> mean(x)
[1] 4.214286

answered Jul 30 '21 at 23:23

ThomasIsCoding

96,636
9
24
81

How to do large combinations with condition in R efficiently?

2 Answers2