1

I have a set of stimuli (statements), half of them are true and half are false. I'd like to randomly assign them to 4 sets containing an equal number of statements, of which half are true and half false statements.

Here's what I've got so far, but I need to add that the randomisation to the 4 sets shoudl be based on the contents a specific binary column (i.e., whether the statement is true or false):

statements <- data.frame(item_ID = c("1", "3", "4", "5", "6", "7"), 
           item = c("The first windmills were built in Persia.", 
"Blackberries, raspberries, and strawberries belong to the Rose family.", 
"The painting “Bal du moulin de la Galette” was created by Renoir.", 
"The name of the Russian space platform Mir means ‘peace’.", 
"The Congo has the largest water flow rate of any river in Africa.", 
"Alberto Fujimori served as president of Peru from 1990 - 2000."
), actual_truth = c("TRUE", "TRUE", "TRUE", "TRUE", "FALSE", "FALSE"
), source = c("DK", "DK", "DK", "DK", "DK", "DK"))

ns <- nrow(statements) * c(0.25, 0.25, 0.25, 0.25)
sum(ns)

rep(1:4, times = ns)

set.seed(4)
head(samp <- sample(rep(1:4, times = ns)))

set1 <- statements[samp == 1,]
set2 <- statements[samp == 2,]
set3 <- statements[samp == 3,]
set4 <- statements[samp == 4,]
Emma
  • 45
  • 3
  • To clarify, I would like to do this in R. – Emma Sep 18 '19 at 18:25
  • 3
    Welcome to SO, Emma! Please make this question *reproducible*. This includes sample code (including listing non-base R packages), sample *unambiguous* data (e.g., `dput(head(x))` or `data.frame(x=...,y=...)`), and expected output. Refs: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. – r2evans Sep 18 '19 at 18:26
  • Further, there are likely many questions on SO about this topic (related: test/train sampling), as it comes up frequently. https://stackoverflow.com/a/56278115/3358272 might be applicable (assigning fixed ratios of different groups), start with a look there. – r2evans Sep 18 '19 at 18:43
  • Thank you r2evans. I've used the code from one of the answers you've suggested and now I have 4 sets of equal size. But I still don't know how to ensure that each sets contains half true and half false statements (i.e., randomise to 4 sets but based on a specific binary column). Above I've added the code I have so far. – Emma Sep 19 '19 at 09:38
  • Emma, you're making good headway on this, thank you. However, imagine what we see on our console: the first thing is `Error: object 'statements' not found`. This is why we prefer minimal *working* examples that are complete-enough to run in a vanilla/fresh environment. If in doubt, start a fresh R instance/project (ensure `ls()` is empty) and run the code you give us, and see what happens. Depending on the data, you can either make fake data (`data.frame(...)`) or a sample of yours (e.g., `dput(head(statements))`) and the expected output from that sample of data. – r2evans Sep 19 '19 at 14:46
  • 1
    Aha, thanks @r2evans. I've added some fake data now so it should run OK. – Emma Sep 19 '19 at 15:28

1 Answers1

0

Assigning even bins

Some options:

library(dplyr)
set.seed(42)
statements <- statements %>%
  group_by(actual_truth) %>%
  mutate(samp = sample(rep(1:4, length.out = n(), replace = TRUE))) %>%
  ungroup()
statements
# # A tibble: 6 x 5
#   item_ID item                                                                   actual_truth source  samp
#   <fct>   <fct>                                                                  <fct>        <fct>  <int>
# 1 1       The first windmills were built in Persia.                              TRUE         DK         2
# 2 3       Blackberries, raspberries, and strawberries belong to the Rose family. TRUE         DK         3
# 3 4       The painting  Bal du moulin de la Galette  was created by Renoir.      TRUE         DK         4
# 4 5       The name of the Russian space platform Mir means  peace .              TRUE         DK         1
# 5 6       The Congo has the largest water flow rate of any river in Africa.      FALSE        DK         2
# 6 7       Alberto Fujimori served as president of Peru from 1990 - 2000.         FALSE        DK         1

To verify how many you have in each:

xtabs(~ actual_truth + samp, data = statements)
#             samp
# actual_truth 1 2 3 4
#        FALSE 1 1 0 0
#        TRUE  1 1 1 1

Base R:

statements <- do.call(rbind,
                      by(statements, statements$actual_truth,
                         function(x) transform(x, samp = sample(rep(1:4, length.out = nrow(x), replace = TRUE)))))

(Note: since by and dplyr:: order things somewhat differently, even with set.seed above, their results are different. This is solely due to the order of processing, not the correctness of the implementation.)


data.table:

library(data.table)
statementsDT <- copy(statements)
setDT(statementsDT)
statementsDT[, samp := sample(rep(1:4, length.out = .N, replace = TRUE)), by = actual_truth]

(Note: ditto the note above.)


Separate into different groups

For this step, while you can do what you've done in your question (assign to set1 through set4), I suggest that what you'll do to one group will be done identically to each group, so it is far better to either (1) keep them in the same frame and process them in a natural grouping operation (e.g., dplyr::group_by or data.table's by= argument); or (2) splitting them into a list and dealing with them with lapply.

For instance:

sets <- split(statements, statements$samp)

generates a list, in this case length 4, where the order of them is typically a lexicographic sort of the key ($samp in this case).

Let's say you wrote a function myfunc that deals with one of your sets, then you would do

out <- lapply(sets, myfunc)

to process each of your sets with the function. (No need to do each samp==1 individually.)

r2evans
  • 141,215
  • 6
  • 77
  • 149