2

Suppose I have the following 10 variables (num_var_1, num_var_2, num_var_3, num_var_4, num_var_5, factor_var_1, factor_var_2, factor_var_3, factor_var_4, factor_var_5):

set.seed(123)

num_var_1 <- rnorm(1000, 10, 1)
num_var_2 <- rnorm(1000, 10, 5)
num_var_3 <- rnorm(1000, 10, 10)
num_var_4 <- rnorm(1000, 10, 10)
num_var_5 <- rnorm(1000, 10, 10)

factor_1 <- c("A","B", "C")
factor_2 <- c("AA","BB", "CC")
factor_3 <- c("AAA","BBB", "CCC", "DDD")
factor_4 <- c("AAAA","BBBB", "CCCC", "DDDD", "EEEE")
factor_5 <- c("AAAAA","BBBBB", "CCCCC", "DDDDD", "EEEEE", "FFFFFF")

factor_var_1 <- as.factor(sample(factor_1, 1000, replace=TRUE, prob=c(0.3, 0.5, 0.2)))
factor_var_2 <-  as.factor(sample(factor_2, 1000, replace=TRUE, prob=c(0.5, 0.3, 0.2)))
factor_var_3 <-  as.factor(sample(factor_3, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.2, 0.1)))
factor_var_4 <-  as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
factor_var_5 <-  as.factor(sample(factor_4, 1000, replace=TRUE, prob=c(0.3, 0.2, 0.1, 0.1, 0.1)))

id = 1:1000

my_data = data.frame(id,num_var_1, num_var_2, num_var_3, num_var_4, num_var_5, factor_var_1, factor_var_2, factor_var_3, factor_var_4, factor_var_5)


> head(my_data)
  id num_var_1 num_var_2 num_var_3 num_var_4  num_var_5 factor_var_1 factor_var_2 factor_var_3 factor_var_4 factor_var_5
1  1  9.439524  5.021006  4.883963  8.496925  11.965498            B           AA          AAA         CCCC         AAAA
2  2  9.769823  4.800225 12.369379  6.722429  16.501132            B           AA          AAA         AAAA         AAAA
3  3 11.558708  9.910099  4.584108 -4.481653  16.710042            C           AA          BBB         AAAA         CCCC
4  4 10.070508  9.339124 22.192276  3.027154  -2.841578            B           CC          DDD         BBBB         AAAA
5  5 10.129288 -2.746714 11.741359 35.984902 -10.261096            B           AA          AAA         DDDD         DDDD
6  6 11.715065 15.202867  3.847317  9.625850  32.053261            B           AA          CCC         BBBB         EEEE

My Question: I am interested in selecting a random number of variables from this data - and taking random subsets from these variables. (And then repeating this process many times). For example - I would like to record such a randomly generated list:

  • Iteration 1: num_var_2 > 12, factor_var_1 = "A, C", factor_var_4 = "BBBB, DDDD, EEEE"

  • Iteration 2: num_var_1 >0, num_var_3 <10, factor_var_2 = "AA, BB, CC", factor_var_3 = "AAA", factor_var_5 = "CCCCC, DDDDD"

  • Iteration 3: num_var_2 <5, num_var_5 <10, factor_var_1 = "B", factor_var_3 = "AAA"

  • Iteration 4 : factor_var_4 = "BBBB"

etc.

I can perform the above manually, but this would take a long time (e.g. 10 iterations). Is there a way to automate this process and in the end, just output this kind of list (10 rows × 2 columns) :

Iteration                                                                                                  Condition
1                                               num_var_2 > 12, factor_var_1 = A, C, factor_var_4 = BBBB, DDDD, EEEE
2            num_var_1 >0, num_var_3 <10, factor_var_2 = AA, BB, CC, factor_var_3 = AAA, factor_var_5 = CCCCC, DDDDD
3                                                  num_var_2 <5, num_var_5 <10, factor_var_1 = B, factor_var_3 = AAA
4                                                                                                factor_var_4 = BBBB

Can someone please show me how to do this?

stats_noob
  • 5,401
  • 4
  • 27
  • 83

2 Answers2

2

As far as I understand, for a factor or a character vector, we need a function to randomly decide a sample size and then a random sample out of some data points. For a numeric vector, we need a function to randomly decide a cutoff between the minimum and maximum and whether to pick numbers greater or smaller than that cutoff point. Finally, we need to summarize the rules based on the formats provided in this post.

Consider the following function for factors and characters. It first decides a random sample size based on the number of items in x and then randomly samples items from x.

random_pick <- function(x) {
  sample_size <- sample.int(length(x), 1L)
  out <- x[sort(sample.int(length(x), sample_size))]
  list("=", out)
}

Also, consider a function like this for numerics. It finds the min/max, determines the cutoff, and the sign for comparison.

random_trunc <- function(x) {
  rng <- range(x)
  cutoff <- runif(1L, rng[[1L]], rng[[2L]])
  sgn <- c("<", ">")[[sample.int(2L, 1L)]]
  list(sgn, cutoff)
}

Then, we assemble these two functions together for your specific case. Note that for characters we only need to pick the unique ones.

random_select <- function(x) {
  if (is.numeric(x))
    return(random_trunc(x))
  if (is.factor(x))
    return(random_pick(levels(x)))
  random_pick(unique(x))
}

report generates the rules we want based on the formats provided.

report <- function(f) function(...) {
  x <- f(...)
  if (x[[1L]] != "=")
    return(sprintf("%s %.2f", x[[1L]], x[[2L]]))
  sprintf("%s \"%s\"", x[[1L]], paste0(x[[2L]], collapse = ", "))
}

Now we are ready to write our function for randomly generating rules from a dataset. The idea is to first randomly select from all variables (except for the first one id), then apply random_rule to each selected, and finally summarize the results.

random_rule <- function(dt) {
  out <- vapply(
    dt[random_pick(names(dt)[-1L])[[2L]]], 
    report(random_select), character(1L)
  )
  paste(names(out), out, collapse = ", ")
}

As a result, we can simply do this for as many iterations as needed

set.seed(123)
data.frame(iteration = 1:10, results = replicate(10L, random_rule(my_data)))

Results

> set.seed(123)
> data.frame(iteration = 1:10, records = replicate(10L, random_rule(my_data)))
   iteration
1          1
2          2
3          3
4          4
5          5
6          6
7          7
8          8
9          9
10        10
                                                                                                                                                                                                                                                  records
1                                                                                                                                                                                             num_var_2 < 12.51, num_var_3 > 41.50, factor_var_1 = "A, B"
2                                                                                                                                         num_var_1 < 11.16, num_var_3 > 15.63, num_var_4 > -3.87, factor_var_2 = "BB", factor_var_4 = "AAAA, BBBB, DDDD"
3                                                                                                          num_var_1 < 9.87, num_var_2 < -1.32, num_var_3 > -5.54, num_var_4 > 24.09, num_var_5 < 3.28, factor_var_2 = "AA, BB, CC", factor_var_3 = "CCC"
4                                                        num_var_1 > 9.72, num_var_2 > -1.93, num_var_3 < 43.27, num_var_4 < 32.11, num_var_5 > -12.77, factor_var_1 = "B", factor_var_2 = "AA", factor_var_4 = "AAAA, BBBB, DDDD", factor_var_5 = "AAAA"
5                                                                                           num_var_1 > 10.51, num_var_2 > 13.61, num_var_3 > 22.14, num_var_4 < -2.75, factor_var_1 = "A, B, C", factor_var_3 = "AAA", factor_var_4 = "BBBB, DDDD, EEEE"
6                                                                                                                                                                                             factor_var_1 = "A, B, C", factor_var_5 = "BBBB, CCCC, EEEE"
7                                                                         num_var_1 > 9.34, num_var_2 < 18.59, num_var_3 < 7.39, num_var_4 > 16.66, num_var_5 > 35.48, factor_var_1 = "C", factor_var_2 = "AA, BB, CC", factor_var_4 = "AAAA, BBBB, CCCC"
8  num_var_1 > 10.66, num_var_2 > 25.74, num_var_3 > 13.81, num_var_4 > 31.73, num_var_5 > -2.40, factor_var_1 = "A, B, C", factor_var_2 = "AA, BB", factor_var_3 = "AAA, CCC, DDD", factor_var_4 = "AAAA, BBBB, CCCC, DDDD, EEEE", factor_var_5 = "DDDD"
9   num_var_1 < 10.78, num_var_2 < 11.86, num_var_3 < -7.95, num_var_4 < 7.12, num_var_5 > 39.57, factor_var_1 = "A, B, C", factor_var_2 = "AA, BB, CC", factor_var_3 = "CCC", factor_var_4 = "BBBB, EEEE", factor_var_5 = "AAAA, BBBB, CCCC, DDDD, EEEE"
10                                                                                                num_var_1 < 7.63, num_var_2 > 19.04, num_var_4 > 37.87, num_var_5 < -14.85, factor_var_1 = "A, B", factor_var_2 = "AA, CC", factor_var_4 = "AAAA, CCCC"

Put everything together

random_pick <- function(x) {
  sample_size <- sample.int(length(x), 1L)
  out <- x[sort(sample.int(length(x), sample_size))]
  list("=", out)
}

random_trunc <- function(x) {
  rng <- range(x)
  cutoff <- runif(1L, rng[[1L]], rng[[2L]])
  sgn <- c("<", ">")[[sample.int(2L, 1L)]]
  list(sgn, cutoff)
}

random_select <- function(x) {
  if (is.numeric(x))
    return(random_trunc(x))
  if (is.factor(x))
    return(random_pick(levels(x)))
  random_pick(unique(x))
}

report <- function(f) function(...) {
  x <- f(...)
  if (x[[1L]] != "=")
    return(sprintf("%s %.2f", x[[1L]], x[[2L]]))
  sprintf("%s \"%s\"", x[[1L]], paste0(x[[2L]], collapse = ", "))
}

random_rule <- function(dt) {
  out <- vapply(
    dt[random_pick(names(dt)[-1L])[[2L]]], 
    report(random_select), character(1L)
  )
  paste(names(out), out, collapse = ", ")
}

set.seed(123)
data.frame(iteration = 1:10, records = replicate(10L, random_rule(my_data)))
ekoam
  • 8,744
  • 1
  • 9
  • 22
  • @ ekoam: Thank you so much for your answer! I will have to spend some more time understanding the code you wrote, but it seems to be working just fine! Thanks so much! I will soon be able to "accept" your answer ! – stats_noob Dec 28 '21 at 23:02
  • Do you know if it is possible to take each row of the final data frame and use it to select rows from "my_data"? Something like this: https://stackoverflow.com/questions/70504566/r-selecting-rows-based-on-conditions-stored-in-a-data-frame ? Thank you so much! – stats_noob Dec 28 '21 at 23:05
  • @stats555 Do you mind changing the syntax of your filter rules to R's native syntax? For example, use `factor_var_1 %in% c("A", "B")` instead of `factor_var_1 = "A, B"`. – ekoam Dec 29 '21 at 00:38
1

You may define a function FUN(n) that creates a data set as shown in OP.

FUN <- function(n=1e3) {
  num_var_1 <- rnorm(n, 10, 1)
  num_var_2 <- rnorm(n, 10, 5)
  num_var_3 <- rnorm(n, 10, 10)
  num_var_4 <- rnorm(n, 10, 10)
  num_var_5 <- rnorm(n, 10, 10)
  factor_1 <- c("A", "B", "C")
  factor_2 <- c("AA", "BB", "CC")
  factor_3 <- c("AAA", "BBB", "CCC", "DDD")
  factor_4 <- c("AAAA", "BBBB", "CCCC", "DDDD", "EEEE")
  factor_5 <- c("AAAAA", "BBBBB", "CCCCC", "DDDDD", "EEEEE", "FFFFFF")
  factor_var_1 <- as.factor(sample(factor_1, n, replace=TRUE, 
                                   prob=c(0.3, 0.5, 0.2)))
  factor_var_2 <- as.factor(sample(factor_2, n, replace=TRUE, 
                                   prob=c(0.5, 0.3, 0.2)))
  factor_var_3 <- as.factor(sample(factor_3, n, replace=TRUE, 
                                   prob=c(0.5, 0.2, 0.2, 0.1)))
  factor_var_4 <- as.factor(sample(factor_4, n, replace=TRUE, 
                                   prob=c(0.5, 0.2, 0.1, 0.1, 0.1)))
  factor_var_5 <- as.factor(sample(factor_5, n, replace=TRUE, 
                                   prob=c(0.3, 0.2, 0.1, 0.1, 0.1, .2)))
  id <- 1:n
  return(data.frame(id, num_var_1, num_var_2, num_var_3, num_var_4, 
                    num_var_5, factor_var_1, factor_var_2, factor_var_3,
                    factor_var_4, factor_var_5))
}

Next, define (appropriate) expressions as strings in a list evl.

evl <- list(
  c('num_var_2 > 12', 'factor_var_1 %in% c("A", "C")', 
    'factor_var_4 %in% c("BBBB", "DDDD", "EEEE")'),
  c('num_var_1 > 0', 'num_var_3 < 10', 'factor_var_2 %in% c("AA", "BB", "CC")',
    'factor_var_3 %in% "AAA"', 'factor_var_5 %in% c("CCCCC", "DDDDD")'),
  c('num_var_2 < 5', 'num_var_5 < 10', 'factor_var_1 %in% "B"',
    'factor_var_3 %in% "AAA"'),
  c('factor_var_4 %in% "BBBB"')
)

Finally, in Map define a function that subsets the data of one replicateion according to the respective expressions using eval(parse(text=)). Use set.seed() outside the function to prevent the same data from being generated on each iteration.

set.seed(42)
result <- Map(\(x, y) x[with(x, eval(parse(text=paste(y, collapse=' & ')))), ],
              replicate(length(evl), FUN(), simplify=FALSE),
              evl)

Note: R version 4.1.2 (2021-11-01)

Gives

str(result)
# List of 4
# $ :'data.frame':  59 obs. of  11 variables:
#   ..$ id          : int [1:59] 3 6 25 29 32 34 52 54 58 93 ...
# ..$ num_var_1   : num [1:59] 9.99 10.95 9.38 8.53 9.65 ...
# ..$ num_var_2   : num [1:59] 13.6 17.4 20.3 19.3 16.1 ...
# ..$ num_var_3   : num [1:59] 9.42 18.67 6.1 25.71 -2.73 ...
# ..$ num_var_4   : num [1:59] 6.29 9.22 3.68 16.27 15.77 ...
# ..$ num_var_5   : num [1:59] 13.37 18.86 4.89 24.18 26.11 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 3 1 3 1 3 3 1 3 1 1 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 3 3 1 1 1 2 3 3 1 3 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 1 2 1 1 4 2 1 3 2 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 5 2 2 2 2 2 5 2 4 4 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 5 2 3 5 4 4 6 1 6 ...
# $ :'data.frame':  53 obs. of  11 variables:
#   ..$ id          : int [1:53] 2 14 28 36 49 59 75 103 134 137 ...
# ..$ num_var_1   : num [1:53] 9.67 11.61 11.22 10.14 10.5 ...
# ..$ num_var_2   : num [1:53] 10.89 7.12 2.38 13.28 10.88 ...
# ..$ num_var_3   : num [1:53] 5.87 7.33 2.88 -10.78 4.09 ...
# ..$ num_var_4   : num [1:53] 19.239 6.261 -0.158 14.586 -0.544 ...
# ..$ num_var_5   : num [1:53] -5.1 21.04 2.81 1.76 27.19 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 1 1 1 2 3 2 3 3 2 3 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 2 2 2 3 3 3 3 2 1 1 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 1 1 1 1 1 1 1 1 1 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 1 5 5 1 4 4 4 4 1 4 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 4 4 3 3 4 4 4 4 3 ...
# $ :'data.frame':  20 obs. of  11 variables:
#   ..$ id          : int [1:20] 3 44 91 181 222 233 241 287 293 302 ...
# ..$ num_var_1   : num [1:20] 12 10.26 9.65 8.48 12.1 ...
# ..$ num_var_2   : num [1:20] 3.68 3.61 3.28 4.01 1.78 ...
# ..$ num_var_3   : num [1:20] 4.113 -3.481 17.654 0.496 5.457 ...
# ..$ num_var_4   : num [1:20] 9.25 19.79 17.15 -4.72 22.16 ...
# ..$ num_var_5   : num [1:20] 6 8.49 4.31 4.67 1.96 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 2 2 2 2 2 2 2 2 2 2 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 2 1 3 1 1 1 1 3 2 1 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 1 1 1 1 1 1 1 1 1 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 3 1 1 1 1 1 1 1 1 1 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 5 5 1 1 1 2 6 1 2 ...
# $ :'data.frame':  205 obs. of  11 variables:
#   ..$ id          : int [1:205] 7 10 23 24 27 29 31 33 38 40 ...
# ..$ num_var_1   : num [1:205] 10.23 9.78 8.92 10.16 9.93 ...
# ..$ num_var_2   : num [1:205] 23.49 13.06 12.17 16.88 7.93 ...
# ..$ num_var_3   : num [1:205] 6.33 9.33 14.04 21.66 28.56 ...
# ..$ num_var_4   : num [1:205] 16.33 -1.805 0.509 21.2 15.158 ...
# ..$ num_var_5   : num [1:205] 8.48 -1.31 5.03 15.07 19.48 ...
# ..$ factor_var_1: Factor w/ 3 levels "A","B","C": 1 1 2 1 2 1 2 2 3 2 ...
# ..$ factor_var_2: Factor w/ 3 levels "AA","BB","CC": 3 1 1 2 1 1 1 2 1 3 ...
# ..$ factor_var_3: Factor w/ 4 levels "AAA","BBB","CCC",..: 1 2 3 1 3 4 3 1 3 2 ...
# ..$ factor_var_4: Factor w/ 5 levels "AAAA","BBBB",..: 2 2 2 2 2 2 2 2 2 2 ...
# ..$ factor_var_5: Factor w/ 6 levels "AAAAA","BBBBB",..: 3 5 2 6 6 2 6 1 2 2 ...
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • @ jay.sf : Thank you so much for your answer! But I don't think this is what I am looking for? I am looking for something more along the lines of this: https://stackoverflow.com/questions/70477594/r-creating-a-function-to-randomly-replace-data-from-a-data-frame – stats_noob Dec 26 '21 at 20:13
  • As an example, imagine there are three hats and each hat contains 10 balls. there are also three buckets of sand, and each bucket contains 1 kilogram of sand. There are 6 containers all together (3 hats and 3 buckets). First, you pick a random number of containers (e.g. imagine you pick 3 containers: hat 2, hat 3, bucket 1). Then, you pick a random number of balls from hat 2 (e.g. 4 balls), a random number of balls from hat 3 (e..g 7 balls), and scoop out a random amount of sand from bucket 1 (e.g. 153 grams of sand). This is iteration #1. – stats_noob Dec 26 '21 at 20:16
  • Then, you repeat this whole exercise again - this time, you pick hat 1, bucket 2, bucket 3. This time, you happen to choose 5 balls from hat 1, 103 grams of sand from bucket 2 and 641 grams of sand from bucket 3. This is iteration #2 – stats_noob Dec 26 '21 at 20:17
  • I would like to create a "n" rows x 2 column table that contains the results from each iteration. The first column is the iteration number, and the second column contains which variables were randomly selected, and what "ranges" were selected from each of those variables. – stats_noob Dec 26 '21 at 20:19
  • Thank you so much for all your help! – stats_noob Dec 26 '21 at 20:19