Let x be a dataset with 5 variables and 15 observations:
age gender height weight fitness
17 M 5.34 68 medium
23 F 5.58 55 medium
25 M 5.96 64 high
25 M 5.25 60 medium
18 M 5.57 60 low
17 F 5.74 61 low
17 M 5.96 71 medium
22 F 5.56 75 high
16 F 5.02 56 medium
21 F 5.18 63 low
20 M 5.24 57 medium
15 F 5.47 72 medium
16 M 5.47 61 high
22 F 5.88 73 low
18 F 5.73 62 medium
The frequencies of the values for the fitness variable are as follows: low = 4, medium = 8, high = 3.
Suppose I have another dataset y with the same 5 variables but 100 observations. The frequencies of the values for the fitness variable in this dataset are as follows: low = 42, medium = 45, high = 13.
Using R, how can I obtain a representative sample from y such that the sample fitness closely matches the distribution of the fitness in x?
My initial ideas were to use the sample function in R and assign weighted probabilities for the prob argument. However, using probabilities would force an exact match for the frequency distribution. My objective is to get a close enough match while maximizing the the sample size.
Additionally, suppose I wish to add another constraint where the distribution of the gender must also closely match that of x?