-2

I want to use R to sample my dataframe. My data is timestamped epidemiological data, and I want to randomly sample at least 1 and as many as 10 records for each year, preferably in a manner that is scaled to the number of records for each year. I would like to export the results as a csv.

here are a few lines of my dataset, where I've left off the long genetic sequence field for each record.

year    matrix  USD clade  
1958    W   mG018U  UP  
1958    W   mG018U  UP  
1958    W   mG018U  UP  
1966    UN  mG140L  LL  
1969    UN  mG207L  LL  
1969    UN  mG013L  LL  
1971    UN  mG208L  LL  
1972    HA  mG129M  MN  
1973    C1  mG018U  UP  
1973    NA  mG001U  UC  
1973    NA  mG001U  UC

all I've learned to do is

sample(mydata, size = 600, replace = FALSE)

which doesn't of course take the year into account.

Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
user21068
  • 9
  • 3
  • 2
    Please show few lines of your dataset. – akrun Jan 26 '15 at 19:03
  • 4
    Please provide an example data frame and some examples of your desired output so that people can help you. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – keegan Jan 26 '15 at 19:03
  • 1
    What's your question? – Bonifacio2 Jan 26 '15 at 19:03
  • 1
    You have to make some effort to code it yourself and post that code here when you hit a specific problem, you can't just ask a *"Give me teh codez"* type question. Otherwise this will be closed quickly for lack of effort. – smci Jan 26 '15 at 19:17
  • Thanks for the etiquette suggestions, I'm clearly new to this site. do I use blockquote for an example of my data? – user21068 Jan 26 '15 at 19:18
  • @user21068 Take a look at [this section](http://stackoverflow.com/help/how-to-ask). – Bonifacio2 Jan 26 '15 at 19:21
  • also, I'm trying very hard to teach myself this stuff, but I don't come at it from any kind of coding background, so even html code suggestions are a bit opaque. I don't know how else to learn except to wade in and hope for kind help. – user21068 Jan 26 '15 at 19:25
  • `sample` does not take account of the year, no. That is an implementation detail you would need to add. Just as `sample` does not take into account today's weather, or if its day or night....if you want your "random" sampling to have a non-random dimension/aspect to it (possibly based on your data), then you need to implement that. Perhaps you want more recent records to be more likely to be sampled? In which case, you can provide a `prob` argument to `sample`. Hit up `?sample` to see more. – Rusan Kax Jan 26 '15 at 20:16
  • Rusan Kax, I realize that I need to add additional arguments, and I have read ?sample. I don't know how to implement prob to do what I need, though, which is why I am here. – user21068 Jan 26 '15 at 20:41
  • I'm doing my best to provide adequate information and evidence of my own effort here. is that worth taking the -2 stigma off my post? – user21068 Jan 26 '15 at 22:25
  • I think you should split by year, count the number of rows and implement some logic (for example using `if()`) to control the number of samples taken. As for the down votes, it helps to have a question well thought out from the start, it's PITA to get the downvoters back to correct their vote. I invite you to learn how to ask questions from some of the higest voted questions on SO [tag:r]. – Roman Luštrik Jan 27 '15 at 11:22

1 Answers1

1

There are many possibilities to run sample per group (for example sample_n in the dplyr package), here's an illustration using the data.table package.

You can set a fraction of, let's say 0.1, of the amount of the records you want to sample out of each year so the size will be relative, wrap it up in ceiling in case this fraction is smaller than 1, and restrict to maximum 10 per group using the min function, for example

library(data.table)
setDT(df)[, .SD[sample(.N, min(10, ceiling(.N*.1)))], year]
#   year matrix    USD clade
#1: 1958      W mG018U    UP
#2: 1966     UN mG140L    LL
#3: 1969     UN mG013L    LL
#4: 1971     UN mG208L    LL
#5: 1972     HA mG129M    MN
#6: 1973     NA mG001U    UC 
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • thank you David, I am trying to parse your suggestion now. My version of R (3.0.2) does not have a data.table package, though? – user21068 Jan 30 '15 at 21:43
  • what does .N mean? is that the 0.1 fraction I'm targeting? – user21068 Jan 30 '15 at 22:59
  • `data.table` is a package. You need to install it first using `install.packages("data.table")` and only then run the code above. `.N` denotes the number of observations you have in each group, see `?data.table`. – David Arenburg Jan 31 '15 at 18:44
  • thanks David, I realized my problem too late to edit my answer- I was without internet to be able to get the data.table package. I'm getting that going now. – user21068 Feb 01 '15 at 20:06
  • so if .N is the number of observations in each group, is that a preset or static group size, or can the function assess the group size for each as it works? the number of records I have in each year is highly variable. – user21068 Feb 01 '15 at 20:11
  • `.N` is dynamic and access each groups size separately. Please read `?data.table`. – David Arenburg Feb 01 '15 at 20:12
  • I am getting many errors when I try to load the data.table package, therefore I cannot access ?data.table. I am trying to bring the load error to the data.table users group, but despite registering as a user, my question is being rejected because my email address is unknown. the error I get when either loading or calling library(data.table) is Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : there is no package called ‘stringr’ Error: package or namespace load failed for ‘data.table’ – user21068 Feb 02 '15 at 20:51
  • Close R and restart it. Then run `install.packages("data.table")`, did you get any errors? – David Arenburg Feb 02 '15 at 20:54
  • that worked, but when I call library(data.table) I get the error Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : there is no package called ‘stringr’ Error: package or namespace load failed for ‘data.table’ – user21068 Feb 03 '15 at 21:00
  • I don't know how to help you, sorry. Maybe try downloading `data.table` straight from Github: https://github.com/Rdatatable/data.table/wiki/Installation – David Arenburg Feb 03 '15 at 21:03
  • I was able to get it to work by installing that last "stringr" package. Now I'll start playing with your suggestion. thanks for your patience. and thanks for the github link, that is interesting – user21068 Feb 04 '15 at 00:18
  • 1
    yippee! I've finally gotten it to work! thanks David! – user21068 Feb 24 '15 at 22:24