I need to create a loop which will find a subset of data with greatest variance. I have a dataset of 150 genotypes and want to know which subset of 75 of them is a subset with greatest variance. Making all possible combinations of subsets (~9.3×10^43) and calculating their variance is impossible.
So I think following procedure should give (approximately) what I want:
- The initial subset is of size 3. It consists of a minimum and a maximum, and of a third member which is one of the remaining genotypes. For each combination of 3 variance is calculated, the sample with the largest variance is found, and it is transferred to the next iteration.
- The sample size is increased by 1 by adding one of the remaining data one by one. The one with the largest variance is selected again and transferred to the next iteration.
- The procedure continues until the desired sample size is reached.
Here I provide sample data for 20 genotypes:
Genotype <-c("BK001","BK002","BK003","BK004","BK005","BK006","BK007","BK008","BK009","BK010", "BK011","BK012","BK013","BK014","BK015","BK016","BK017","BK018","BK019","BK020")
Protein <- c(13.25287,14.34778,13.87116,14.00869,14.77897,14.43378,15.89361,15.96695,13.78778, 12.84457,12.99955,14.28378,14.15799,12.42578,14.80507,13.56095,15.26557,14.45378,13.06739,
14.34230)
my_df <- data.frame(Genotype, Protein)
Desired outcome is a list of 10 genotypes which gives a subset with highest variance.
My question is how to make this within R?