0

I am working with the R programming language. I got the following loop to run:

library(dplyr)

list_results <- list()
for (i in 1:100){
    
    c1_i = c2_i = c3_i = 0
    
    while(c1_i + c2_i  + c3_i < 15 ){
        
        
        num_1_i = sample_n(iris, 30)
        num_2_i = sample_n(iris, 30)
        num_3_i = sample_n(iris, 30)
        
        
        c1_i = mean(num_1_i$Sepal.Length)
        c2_i = mean(num_2_i$Sepal.Length)
        c3_i = mean(num_3_i$Sepal.Length)
        ctotal_i = c1_i + c2_i  + c3_i

  combined_i = rbind(num_1_i, num_2_i, num_3_i)
        nrow_i = nrow(unique(combined_i[duplicated(combined_i), ]))
        
    }
    
    inter_results_i <- data.frame(i, c1_i, c2_i, c3_i, nrow_i, ctotal_i)
    list_results[[i]] <- inter_results_i
}

Now, I want to try and add a second condition to this loop. Using this post as a reference (How to have two conditions in a While loop?), I tried to do this as follows:

list_results <- list()
for (i in 1:100){
    
    c1_i = c2_i = c3_i = ctotal_i =  0
    
    while(c1_i + c2_i  + c3_i < 15 && nrow_i == 0 ) {
        
        
        num_1_i = sample_n(iris, 30)
        
        
        
        num_2_i = sample_n(iris, 30)
        
        
        num_3_i = sample_n(iris, 30)
        
        
        c1_i = mean(num_1_i$Sepal.Length)
        c2_i = mean(num_2_i$Sepal.Length)
        c3_i = mean(num_3_i$Sepal.Length)
        ctotal_i = c1_i + c2_i  + c3_i
        
        combined_i = rbind(num_1_i, num_2_i, num_3_i)
        nrow_i = nrow(unique(combined_i[duplicated(combined_i), ]))
        
    }
    
    inter_results_i <- data.frame(i, c1_i, c2_i, c3_i, ctotal_i, nrow_i)
    list_results[[i]] <- inter_results_i
}

But for some reason, this is always producing an "empty" list.

Can someone please show me what I am doing wrong and how to fix this?

Thanks!

stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 1
    You have set `nrow_i` to 0 in the first line of the for loop. So the condition `nrow_i != 0` will never evalutae as true so the while loop won't execute. – Muon Jul 04 '22 at 03:51
  • @ Muon : Thank you for pointing this out! I made this correction and it seems to work. But then I tried nrow_i < 5 and it goes back to producing an empty list. Do you know why this is happening? Thank you so much! – stats_noob Jul 04 '22 at 03:57
  • That works fine for me? Is this your new condition? `while(c1_i + c2_i + c3_i < 15 && nrow_i < 5 )`. By the way, as a side note it's fine to just use `&` instead off `&&` in this case ([more info](https://stackoverflow.com/questions/16027840/whats-the-differences-between-and-and-in-r)). – Muon Jul 04 '22 at 04:08
  • What exactly do you want this loop to achieve? It makes it hard to debug without knowing the expected behaviour? – Muon Jul 04 '22 at 04:09
  • When I used : while(c1_i + c2_i + c3_i < 15 && nrow_i < 5 ) ... I basically get an empty list. – stats_noob Jul 04 '22 at 04:26
  • 1
    Copy paste this and let me know if you still get an empty list. https://pastebin.pl/view/a7349a92 – Muon Jul 04 '22 at 04:40
  • Thank you, this worked!! I just find it so confusing ... I took the code from "pastebin" and reversed the condition in the loop: while(c1_i + c2_i + c3_i < 15 & nrow_i > 5) now produces an empty list .... but while(c1_i + c2_i + c3_i < 15 & nrow_i < 5) works fine. Is there some reason for this? Thank you so much for all your help! – stats_noob Jul 04 '22 at 05:05
  • 1
    @stats_noob Can you explain in words what you're trying to do. This code can be *significantly* improved using vectorised operations. – Maurits Evers Jul 04 '22 at 05:38
  • @ Mauritis Evers: Thank you for your reply! I am trying to learn more about WHILE LOOPS. In this case, I want to randomly take samples from the iris dataset and always make sure that none of these random samples have any rows in common (or less than "n" rows in common). I thought this would be taken care of using the "nrow_i > 5" option. Do you have any ideas about this? Thank you so much! – stats_noob Jul 04 '22 at 05:42
  • 1
    Here's how I read the first steps of your code: (1) Draw 30 samples without replacement from `iris$Sepal.Length` and calculate the mean. Do this 3 times. (2) Calculate the sum of the three Sepal.Length means. (3) Calculate the number of duplicated samples across all 3x30 samples. (4) If the number of dupes is less than x and the sum of the means is less than y, do z. [Not sure on the x, y, z, statements in the comments don't seem to match the original post]. – Maurits Evers Jul 04 '22 at 05:43
  • @ Maurits Evers - yes, that is correct! – stats_noob Jul 04 '22 at 05:48

1 Answers1

1

Here is an attempt at optimising your code using vectorised functions. I have also renamed your variables to be more descriptive.

# Set fixed seed for reproducibility
set.seed(2020)

sample_function <- function(sum_of_mean_thresh = 15, n_dupes_thresh = 10) {
    # Still uses a `while` loop
    sum_of_mean <- 0
    n_dupes <- 0
    sample_idx <- matrix()
    while(sum_of_mean < sum_of_mean_thresh & n_dupes < n_dupes_thresh) {   
        sample_idx <- replicate(3L, sample(nrow(iris), 30L))
        sum_of_mean <- sum(apply(sample_idx, 2, function(row) mean(iris$Sepal.Length[row])))
        n_dupes <- sum(duplicated(as.integer(sample_idx)))
    }
    # Return:
    #  - 30x3 matrix of row indices for each of the 3 samples
    #  - the sum of the mean of the sampled iris$Sepal.Length
    #  - the number of duplicate rows across all 3x30 samples
    list(sample_idx = sample_idx, sum_of_mean = sum_of_mean, n_dupes = n_dupes)
}

# Execute the sample function 100 times and return a `list` 
# (with every element being a `list` returned from `sample_function()`)
replicate(100, sample_function(), simplify = FALSE)

This should be significantly faster than the original code.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Thank you so much for your answer! I was trying to learn how to use WHILE LOOPS - if you have time, could you please try to show me how to do this with WHILE LOOPS (as the way I was approaching the problem)? Thank you so much! – stats_noob Jul 04 '22 at 06:38
  • @stats_noob The `while` loop is still part of `sample_function`. `replicate(100, ...)` replaces your `for` loop. Inside `while` we use vectorised functions to speed up the sampling and processing steps. – Maurits Evers Jul 04 '22 at 07:09
  • @ Maurits Evers : Thank you so much for your answer! I just ran the code you posted and observed the following output: [[100]]$n_dupes [1] 21 – stats_noob Jul 04 '22 at 13:56
  • It seems like the dup threshold is not being respected? Can you please take a look at this if you have time? Thank you so much! – stats_noob Jul 04 '22 at 13:57
  • @stats_noob It behaves exactly like your original code and the reason for why you're seeing values in the return that exceed the thresholds (this happens for both `n_dupes` and `sum_of_mean`) is because in a while loop, the condition is checked on entry. Consider the following as a simple example: `i <- 0; while (i < 4) i <- i + 1.5; print(i)` Notice how the last `i` value is 4.5 which is greater than the threshold (i4). This is just how `while` loops work. – Maurits Evers Jul 05 '22 at 00:01
  • The same thing happens with the lengthy `for` + `while` code on pastebin. If you want to store the value(s) prior to re-calculating, you need to do this at the beginning of the `while` loop. The logic is then as follows: (1) Check `while` condition and enter loop. (2) Save `n_dupes` and `sum_of_mean` from *previous* iteration. (3) Re-calculate `n_dupes` and `sum_of_mean`. (4) Back to (1). – Maurits Evers Jul 05 '22 at 00:05
  • @ Mauritis: Thank you for your reply! I would have thought that while(c1_i + c2_i + c3_i < 15 && nrow_i > five ) would keep running (even if it runs for infinite time) until all values of "nrow_i < 5". Is this correct? In short - suppose I didn't care how long the R code takes to run - what kind of condition would I have to write to ENSURE that this WHILE LOOP ONLY outputs results where nrow_i<5? – stats_noob Jul 05 '22 at 00:10
  • As I said, `while` checks the condition on entry. If the condition is fulfilled (in your case, it is since we saved the previous set of values) it continues to run, re-calculates and updates `n_dupes` and `sum_of_mean`, and then on the next iteration breaks out of the loop since the condition is false. Take a look at the mini example from my comment to understand what's going on. – Maurits Evers Jul 05 '22 at 00:44