1

Are there are functions in R that arrange observations in groups of N that reflect, as closely as possible, the data set proportions of certain variables?

For example, if I have a data set with 8 observations and two variables each with two levels with data set proportions as follows:

    Var1 Var2
1   0.5  0.5
2   0.5  0.5

Are there any functions that would enable me to optimally sample from the data set to say create groups of 2 observations that reflect the above data set proportions?

Example data:

Data <- read.table(text="   Obs Var1    Var2    
    1   1   1   
    2   1   2   
    3   2   1   
    4   2   2   
    5   1   1   
    6   1   2   
    7   2   1   
    8   2   2   ", header=T)

Desired Result:

Result <- read.table(text=" Obs Var1    Var2    Group_ID    
    1   1   1   1   
    4   2   2   1   
    2   1   2   2   
    3   2   1   2   
    5   1   1   3   
    7   2   1   3   
    6   1   2   4   
    8   2   2   4   ", header=T)

Not that all groups have proportions of .5 for each level of each variable.

RTrain3k
  • 845
  • 1
  • 13
  • 27
  • 1
    Do you want this to generalize? And if so, it what way - more variables, more different proportions? Is the group size an input, or does it need to be detected? What happens in cases where an exact solution isn't possible? – Gregor Thomas Feb 03 '17 at 18:46
  • The data that I am trying to implement this in has 4 variables with varying levels from 3-4. I am trying to have each group reflect, as close as possible, the data set proportions of these levels. An exact solution will not be possible, so I was thinking of just randomly assigning the remaining observations to groups. The other issue is I have no conception of how to optimize the groups, i.e. test to see if the observations are efficiently grouped. – RTrain3k Feb 03 '17 at 19:12
  • 1
    And for what purpose are you doing this? I would recommend just doing a random sample or a stratified random sample based on one or two variables. Bootstrap it if you need to reduce variance. If you ignore that advice and want to proceed with the question, I think you need a slightly messier example and to be clear about what the inputs are and what is to be detected/calculated. I still don't know if the group size `N` is an input variable or not. – Gregor Thomas Feb 03 '17 at 19:17
  • Sorry about that @Gregor. Group size `N` would be an input. For example my data frame has 660 observations and I am trying to group them into 10 observations each. I found a function in the second answer here, http://stackoverflow.com/questions/13536537/partitioning-data-set-in-r-based-on-multiple-classes-of-observations, but it does not handle variables that have varying levels. – RTrain3k Feb 03 '17 at 19:31
  • One other algorithm proposal: randomly assign groups of size `N`. Save and remove any groups that meet your requirements (preferably with some tolerance factor) and repeat on the remaining data. Iterate until you're satisfied. – Gregor Thomas Feb 03 '17 at 19:34
  • Thanks @Gregor, I will work on implementing you suggestion. I have no idea how to incorporate things like tolerance factors or saving some groups while discarding others. Do you know to any tutorials or sample code that I can learn from? – RTrain3k Feb 03 '17 at 19:53
  • It's all about how you define "good enough". The more explicit you can be, the better. For example, in testing a subset, a "zero-tolerance" way might be `mean(subset$var1 == var1_level1) == var1_level1_goal`. A "with tolerance" way to make the same check would be `abs(mean(subset$var1 == var1_level1) - var1_level1_goal) < tolerance`. – Gregor Thomas Feb 03 '17 at 19:56
  • But again, I'd come back to *"why are you doing this"*. It seems like a lot of effort for a miniscule gain. Just do a random assignment or stratified random assignment. – Gregor Thomas Feb 03 '17 at 19:57
  • http://stats.stackexchange.com/questions/151735/stratified-random-sampling-implementation-how-to-in-r less accurate : http://stackoverflow.com/questions/23479512/stratified-random-sampling-from-data-frame-in-r – Qbik Feb 03 '17 at 21:02

0 Answers0