1

I am working with R. I am trying to follow this tutorial over here on function optimization: https://rpubs.com/Argaadya/bayesian-optimization

For this example, I first generate some random data:

#load libraries
library(dplyr)


# create some data for this example
a1 = rnorm(1000,100,10)
b1 = rnorm(1000,100,5)
c1 = sample.int(1000, 1000, replace = TRUE)
train_data = data.frame(a1,b1,c1)

From here, I define the function that I want to optimize ("fitness"). This function takes 7 inputs and calculates a "total" mean (a single scalar value). The inputs required for this function are:

  • "random_1" (between 80 and 120)
  • "random_2" (between "random_1" and 120)
  • "random_3" (between 85 and 120)
  • "random_4" (between random_2 and 120)
  • "split_1" (between 0 and 1)
  • "split_2" (between 0 and 1)
  • "split_3" (between 0 and 1)

The function to optimize ("fitness") is defined as follows:

#define fitness function : returns a single scalar value called "total"
fitness <- function(random_1, random_2, random_3, random_4, split_1, split_2, split_3) {

    #bin data according to random criteria
    train_data <- train_data %>% mutate(cat = ifelse(a1 <= random_1 & b1 <= random_3, "a", ifelse(a1 <= random_2 & b1 <= random_4, "b", "c")))
    
    train_data$cat = as.factor(train_data$cat)
    
    #new splits
    a_table = train_data %>%
        filter(cat == "a") %>%
        select(a1, b1, c1, cat)
    
    b_table = train_data %>%
        filter(cat == "b") %>%
        select(a1, b1, c1, cat)
    
    c_table = train_data %>%
        filter(cat == "c") %>%
        select(a1, b1, c1, cat)
    

    
    #calculate  quantile ("quant") for each bin
    
    table_a = data.frame(a_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_1)))
    
    table_b = data.frame(b_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_2)))
    
    table_c = data.frame(c_table%>% group_by(cat) %>%
                             mutate(quant = quantile(c1, prob = split_3)))
    
    
    
    
    #create a new variable ("diff") that measures if the quantile is bigger than the value of "c1"
    table_a$diff = ifelse(table_a$quant > table_a$c1,1,0)
    table_b$diff = ifelse(table_b$quant > table_b$c1,1,0)
    table_c$diff = ifelse(table_c$quant > table_c$c1,1,0)
    
    #group all tables
    
    final_table = rbind(table_a, table_b, table_c)
# calculate the total mean : this is what needs to be optimized
    mean = mean(final_table$diff)
    
    
}

**Goal:**I now want to use the "bayesian optimization" algorithm from this tutorial (https://rpubs.com/Argaadya/bayesian-optimization). The objective is to find the value of these 7 numbers that produces the biggest value of "mean":

First, define the "search bound":

#define search bound for the 7 inputs

library(rBayesianOptimization)

random_1 = NULL
 random_2 = NULL
random_3 = NULL
 random_4 = NULL
split_1 = NULL
split_2 = NULL
split_3 = NULL

search_bound <- list(random_1 = c(80,120), random_2 = c(random_1,120),
                     random_3 = c(85,120), random_4 = c(random_2, 120),  split_1 = c(0,1), split_2 = c(0,1), split_3 = c(0,1))

Second, set the initial sample:

#set initial sample:

set.seed(123)
search_grid <- data.frame(random_1 = runif(20,80,120), 
                          random_2 = runif(20,random_1,120),
                          random_3 = runif(20,85,120),
random_4 = runif(20,random_2,120),
split_1= runif(20,0,1),
split_2 = runif(20,0,1),
split_3 = runif(20,0,1)
)

Finally, run the Bayesian Optimization algorithm:

#run the bayesian optimization algorithm:
set.seed(1)
bayes_finance_ei <- BayesianOptimization(FUN = fitness, bounds = search_bound, 
                     init_grid_dt = search_grid, init_points = 0, 
                     n_iter = 10, acq = "ei")

But this produces the following error:

Error in FUN(X[[i]], ...) : subscript out of bounds

Can someone please show me what I am doing wrong? I thought I followed all necessary steps from the tutorial?

Thanks

stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 2
    You've asked quite a few questions (either here or elsewhere) with the same basic data. I highly suggest you take some time to learn how to make a good minimal reproducible example https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example as your current questions contain way too much bloat – Dason Jul 08 '21 at 00:00
  • @Dason: thank you for your suggestions. I thought if I have added more detail, they would make the questions more clear. But you are right, there is too much bloat. I will try to change that in the future. – stats_noob Jul 08 '21 at 00:01
  • 2
    It's also just good general debugging advice. When coming up with a good reproducible example a lot of times you end up solving your own problem. – Dason Jul 08 '21 at 00:02
  • @Dason: I do not have a computer science or computer programming background. Until now, I have been working with functions from well known libraries and packages in R. Now, I am trying to learn and teach myself how to create my own functions in R, and I am struggling. But hopefully, I will learn soon. – stats_noob Jul 08 '21 at 00:04
  • 1
    Definitely. It's not easy. The link I provided is very helpful. – Dason Jul 08 '21 at 00:18
  • Last night i spent hours trying to figure out why these predefined optimization functions in R wouldnt accept the function I created: https://stackoverflow.com/questions/68280857/r-x-probs-outside-0-1 – stats_noob Jul 08 '21 at 00:21
  • I think I responded to that on reddit – Dason Jul 08 '21 at 00:30

1 Answers1

2

There appear to be a few bugs in your code, e.g. I don't think your fitness function was returning data in the required format and some of your vectors were being used before they were defined.

I made some changes so your code was more inline with the tutorial, and it seems to complete without error, but I can't say whether the outcome is "correct" or whether it will be suitable for your use-case:

#load libraries
library(tidyverse)
#install.packages("rBayesianOptimization")
library(rBayesianOptimization)

# create some data for this example
a1 = rnorm(1000,100,10)
b1 = rnorm(1000,100,5)
c1 = sample.int(1000, 1000, replace = TRUE)
train_data = data.frame(a1,b1,c1)

#define fitness function : returns a single scalar value called "total"
fitness <- function(random_1, random_2, random_3, random_4, split_1, split_2, split_3) {
  
  #bin data according to random criteria
  train_data <- train_data %>% mutate(cat = ifelse(a1 <= random_1 & b1 <= random_3, "a", ifelse(a1 <= random_2 & b1 <= random_4, "b", "c")))
  
  train_data$cat = as.factor(train_data$cat)
  
  #new splits
  a_table = train_data %>%
    filter(cat == "a") %>%
    select(a1, b1, c1, cat)
  
  b_table = train_data %>%
    filter(cat == "b") %>%
    select(a1, b1, c1, cat)
  
  c_table = train_data %>%
    filter(cat == "c") %>%
    select(a1, b1, c1, cat)
  
  
  
  #calculate  quantile ("quant") for each bin
  
  table_a = data.frame(a_table%>% group_by(cat) %>%
                         mutate(quant = quantile(c1, prob = split_1)))
  
  table_b = data.frame(b_table%>% group_by(cat) %>%
                         mutate(quant = quantile(c1, prob = split_2)))
  
  table_c = data.frame(c_table%>% group_by(cat) %>%
                         mutate(quant = quantile(c1, prob = split_3)))
  
  
  
  
  #create a new variable ("diff") that measures if the quantile is bigger than the value of "c1"
  table_a$diff = ifelse(table_a$quant > table_a$c1,1,0)
  table_b$diff = ifelse(table_b$quant > table_b$c1,1,0)
  table_c$diff = ifelse(table_c$quant > table_c$c1,1,0)
  
  #group all tables
  final_table = rbind(table_a, table_b, table_c)
  
  # calculate the total mean : this is what needs to be optimized
  mean = mean(final_table$diff)
  
  # Based on the tutorial you linked, the fitness func
  # needs to return a list (a Score and a Pred)
  # I'm not sure if this is in line with your intended use-case
  # but it seems to work
  result <- list(Score = mean, Pred = 0)
  return(result)
}

# There were some bugs in this section,
# e.g. you were trying to call vectors ("random_1")
# that hadn't been defined yet
set.seed(123)
random_1 = runif(20,80,120)
random_2 = runif(20, random_1, 120)
random_3 = runif(20,85,120)
random_4 = runif(20, random_2, 120)
split_1= runif(20,0,1)
split_2 = runif(20,0,1)
split_3 = runif(20,0,1)

search_bound <- list(random_1 = c(80,120),
                     random_2 = c(random_1,120),
                     random_3 = c(85,120),
                     random_4 = c(random_2, 120), 
                     split_1 = c(0,1),
                     split_2 = c(0,1),split_3 = c(0,1))

search_grid <- data.frame(random_1, random_2, random_3, random_4, split_1, split_2, split_3)


set.seed(1)
bayes_finance_ei <- BayesianOptimization(FUN = fitness, bounds = search_bound, 
                                         init_grid_dt = search_grid, init_points = 0, 
                                         n_iter = 10, acq = "ei")
#> elapsed = 0.076  Round = 1   random_1 = 91.5031  random_2 = 116.8522 random_3 = 89.9980  random_4 = 118.9459 split_1 = 0.2436195 split_2 = 0.599989  split_3 = 0.6478935 Value = 0.6020 
#> elapsed = 0.023  Round = 2   random_1 = 111.5322 random_2 = 117.3987 random_3 = 99.50912 random_4 = 117.6454 split_1 = 0.6680556 split_2 = 0.3328235 split_3 = 0.3198206 Value = 0.4650 
#> elapsed = 0.026  Round = 3   random_1 = 96.35908 random_2 = 111.5012 random_3 = 99.48035 random_4 = 114.7645 split_1 = 0.4176468 split_2 = 0.488613  split_3 = 0.30772   Value = 0.4520 
#> elapsed = 0.024  Round = 4   random_1 = 115.3207 random_2 = 119.9732 random_3 = 97.90959 random_4 = 119.9805 split_1 = 0.7881958 split_2 = 0.9544738 split_3 = 0.2197676 Value = 0.8840 
#> elapsed = 0.024  Round = 5   random_1 = 117.6187 random_2 = 119.1801 random_3 = 90.33557 random_4 = 119.8480 split_1 = 0.1028646 split_2 = 0.4829024 split_3 = 0.3694889 Value = 0.4690 
#> elapsed = 0.024  Round = 6   random_1 = 81.82226 random_2 = 108.8724 random_3 = 89.85821 random_4 = 113.8633 split_1 = 0.4348927 split_2 = 0.8903502 split_3 = 0.9842192 Value = 0.9090 
#> elapsed = 0.025  Round = 7   random_1 = 101.1242 random_2 = 111.3939 random_3 = 93.15619 random_4 = 118.3654 split_1 = 0.984957  split_2 = 0.9144382 split_3 = 0.1542023 Value = 0.8130 
#> elapsed = 0.023  Round = 8   random_1 = 115.6968 random_2 = 118.2535 random_3 = 101.3087 random_4 = 119.6723 split_1 = 0.8930511 split_2 = 0.608735  split_3 = 0.091044  Value = 0.7490 
#> elapsed = 0.024  Round = 9   random_1 = 102.0574 random_2 = 107.2457 random_3 = 94.30904 random_4 = 117.3770 split_1 = 0.8864691 split_2 = 0.4106898 split_3 = 0.1419069 Value = 0.3830 
#> elapsed = 0.022  Round = 10  random_1 = 98.26459 random_2 = 101.4622 random_3 = 115.0240 random_4 = 109.6157 split_1 = 0.1750527 split_2 = 0.1470947 split_3 = 0.6900071 Value = 0.4150 
#> elapsed = 0.023  Round = 11  random_1 = 118.2733 random_2 = 119.9362 random_3 = 86.60409 random_4 = 119.9843 split_1 = 0.1306957 split_2 = 0.9352998 split_3 = 0.6192565 Value = 0.9250 
#> elapsed = 0.023  Round = 12  random_1 = 98.13337 random_2 = 117.8636 random_3 = 100.4770 random_4 = 119.2079 split_1 = 0.6531019 split_2 = 0.3012289 split_3 = 0.8913941 Value = 0.4050 
#> elapsed = 0.025  Round = 13  random_1 = 107.1028 random_2 = 116.0110 random_3 = 112.9624 random_4 = 118.8439 split_1 = 0.3435165 split_2 = 0.06072057    split_3 = 0.6729991 Value = 0.3040 
#> elapsed = 0.023  Round = 14  random_1 = 102.9053 random_2 = 116.5036 random_3 = 89.26647 random_4 = 116.5058 split_1 = 0.6567581 split_2 = 0.9477269 split_3 = 0.7370777 Value = 0.9340 
#> elapsed = 0.022  Round = 15  random_1 = 84.11699 random_2 = 85.0002  random_3 = 104.6332 random_4 = 101.6362 split_1 = 0.3203732 split_2 = 0.7205963 split_3 = 0.5211357 Value = 0.5130 
#> elapsed = 0.023  Round = 16  random_1 = 115.9930 random_2 = 117.9075 random_3 = 92.2286  random_4 = 118.3681 split_1 = 0.1876911 split_2 = 0.1422943 split_3 = 0.6598384 Value = 0.1610 
#> elapsed = 0.024  Round = 17  random_1 = 89.84351 random_2 = 112.7160 random_3 = 89.46361 random_4 = 115.4826 split_1 = 0.7822943 split_2 = 0.5492847 split_3 = 0.8218055 Value = 0.5790 
#> elapsed = 0.025  Round = 18  random_1 = 81.68238 random_2 = 89.97462 random_3 = 111.3658 random_4 = 108.3733 split_1 = 0.09359499    split_2 = 0.9540912 split_3 = 0.7862816 Value = 0.7880 
#> elapsed = 0.023  Round = 19  random_1 = 93.11683 random_2 = 101.6705 random_3 = 116.3266 random_4 = 108.1188 split_1 = 0.466779  split_2 = 0.5854834 split_3 = 0.9798219 Value = 0.7410 
#> elapsed = 0.026  Round = 20  random_1 = 118.1801 random_2 = 118.6017 random_3 = 98.1062  random_4 = 118.7571 split_1 = 0.5115055 split_2 = 0.4045103 split_3 = 0.4394315 Value = 0.4420 
#> elapsed = 0.023  Round = 21  random_1 = 96.45638 random_2 = 111.5322 random_3 = 87.30664 random_4 = 117.3987 split_1 = 0.1028404 split_2 = 1.0000    split_3 = 0.9511012 Value = 0.9910 
#> elapsed = 0.022  Round = 22  random_1 = 117.5734 random_2 = 91.5031  random_3 = 120.0000 random_4 = 116.8522 split_1 = 2.220446e-16  split_2 = 1.0000    split_3 = 2.220446e-16  Value = 0.0020 
#> elapsed = 0.023  Round = 23  random_1 = 111.5021 random_2 = 111.5322 random_3 = 120.0000 random_4 = 117.3987 split_1 = 0.1188052 split_2 = 1.0000    split_3 = 0.6455233 Value = 0.1910 
#> elapsed = 0.028  Round = 24  random_1 = 80.0000  random_2 = 92.90239 random_3 = 86.18939 random_4 = 116.8635 split_1 = 0.2557032 split_2 = 1.0000    split_3 = 0.3517052 Value = 0.5170 
#> elapsed = 0.026  Round = 25  random_1 = 90.64032 random_2 = 92.3588  random_3 = 112.3187 random_4 = 117.3987 split_1 = 1.0000    split_2 = 1.0000    split_3 = 1.0000    Value = 0.9970 
#> elapsed = 0.022  Round = 26  random_1 = 100.4363 random_2 = 104.1665 random_3 = 106.2099 random_4 = 117.3987 split_1 = 1.0000    split_2 = 1.0000    split_3 = 1.0000    Value = 0.9970 
#> elapsed = 0.022  Round = 27  random_1 = 119.0981 random_2 = 91.5031  random_3 = 120.0000 random_4 = 117.3987 split_1 = 1.0000    split_2 = 1.0000    split_3 = 1.0000    Value = 0.9980 
#> elapsed = 0.023  Round = 28  random_1 = 89.95279 random_2 = 101.9462 random_3 = 85.0000  random_4 = 116.9137 split_1 = 2.220446e-16  split_2 = 1.0000    split_3 = 1.0000    Value = 0.9980 
#> elapsed = 0.027  Round = 29  random_1 = 113.5928 random_2 = 91.5031  random_3 = 120.0000 random_4 = 117.3987 split_1 = 1.0000    split_2 = 2.220446e-16  split_3 = 1.0000    Value = 0.9980 
#> elapsed = 0.027  Round = 30  random_1 = 116.9869 random_2 = 91.5031  random_3 = 120.0000 random_4 = 117.2949 split_1 = 1.0000    split_2 = 0.505048  split_3 = 1.0000    Value = 0.9980 
#> 
#>  Best Parameters Found: 
#> Round = 27   random_1 = 119.0981 random_2 = 91.5031  random_3 = 120.0000 random_4 = 117.3987 split_1 = 1.0000    split_2 = 1.0000    split_3 = 1.0000    Value = 0.9980

Created on 2021-07-08 by the reprex package (v2.0.0)

jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
  • Mamrot: Thank you so much for your answer! I have been struggling so much to define the "fitness" function in my examples. I was a bit confused about the "pred" value, is it really necessary? I spent a lot of time yesterday trying to define this "fitness" function so that I could use different optimization functions in R. If you have time, can you please take a look at this question https://stackoverflow.com/questions/68280857/r-x-probs-outside-0-1 ? Thank you so much for your help! – stats_noob Jul 07 '21 at 23:56
  • 1
    I see you've posted a few similar SO questions on optimization using different methods; this question had everything I needed to answer it (hence the upvote) but I've had a look at two of your others and haven't been able to work out the issue sorry. – jared_mamrot Jul 07 '21 at 23:57
  • Mamrot: Thank you for all your help, I will continue to keep working on this! – stats_noob Jul 07 '21 at 23:58