0

I have a dataframe called test.data where I have a column called Ethnicity. There are three groups of ethnicities (more in actual data), Adygei, Balochi and Biaka_pygmies. I want to subset this data frame to include only two samples (rows) randomly from each ethnic group and get the result. How can I do this in R?

test.data <-  structure(list(Sample = c("1793102418_A", "1793102460_A", "1793102500_A", 
"1793102576_A", "1749751113_A", "1749751187_A", "1749751189_A", 
"1749751285_A", "1749751356_A", "1749751195_A", "1749751218_A", 
"1775705355_A"), Ethnicity = c("Adygei", "Adygei", "Adygei", 
"Adygei", "Balochi", "Balochi", "Balochi", "Balochi", "Balochi", 
"Biaka_Pygmies", "Biaka_Pygmies", "Biaka_Pygmies"), Height = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Sample", "Ethnicity", 
"Height"), row.names = c("1793102418_A", "1793102460_A", "1793102500_A", 
"1793102576_A", "1749751113_A", "1749751187_A", "1749751189_A", 
"1749751285_A", "1749751356_A", "1749751195_A", "1749751218_A", 
"1775705355_A"), class = "data.frame")

result

                        Sample     Ethnicity Height
    1793102418_A 1793102418_A        Adygei      0
    1793102460_A 1793102460_A        Adygei      0
    1749751189_A 1749751189_A       Balochi      0
    1749751285_A 1749751285_A       Balochi      0
    1749751195_A 1749751195_A Biaka_Pygmies      0
    1775705355_A 1775705355_A Biaka_Pygmies      0
MAPK
  • 5,635
  • 4
  • 37
  • 88

1 Answers1

2

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(test.data)), grouped by 'Ethnicity', we sample the sequence of rows and subset the rows based on that.

setDT(test.data)[, .SD[sample(1:.N,2)], Ethnicity]

Or using tapply from base R

test.data[ with(test.data, unlist(tapply(seq_len(nrow(test.data)),
                     Ethnicity, FUN = sample, 2))), ]
akrun
  • 874,273
  • 37
  • 540
  • 662