0

I need to create a loop which will find a subset of data with greatest variance. I have a dataset of 150 genotypes and want to know which subset of 75 of them is a subset with greatest variance. Making all possible combinations of subsets (~9.3×10^43) and calculating their variance is impossible.

So I think following procedure should give (approximately) what I want:

  1. The initial subset is of size 3. It consists of a minimum and a maximum, and of a third member which is one of the remaining genotypes. For each combination of 3 variance is calculated, the sample with the largest variance is found, and it is transferred to the next iteration.
  2. The sample size is increased by 1 by adding one of the remaining data one by one. The one with the largest variance is selected again and transferred to the next iteration.
  3. The procedure continues until the desired sample size is reached.

Here I provide sample data for 20 genotypes:

Genotype <-c("BK001","BK002","BK003","BK004","BK005","BK006","BK007","BK008","BK009","BK010", "BK011","BK012","BK013","BK014","BK015","BK016","BK017","BK018","BK019","BK020")

Protein <- c(13.25287,14.34778,13.87116,14.00869,14.77897,14.43378,15.89361,15.96695,13.78778, 12.84457,12.99955,14.28378,14.15799,12.42578,14.80507,13.56095,15.26557,14.45378,13.06739,
14.34230)

my_df <- data.frame(Genotype, Protein)

Desired outcome is a list of 10 genotypes which gives a subset with highest variance.

My question is how to make this within R?

IvanaOs
  • 1
  • 1

1 Answers1

0

Here is an initial stab at the solution. The first function that came to mind is expand.grid() but, as explained here, you can use merge to find combinations across rows of dataframes.

library(dplyr)

# data
Genotype <- c("BK001","BK002","BK003","BK004","BK005","BK006","BK007","BK008","BK009","BK010", "BK011","BK012","BK013","BK014","BK015","BK016","BK017","BK018","BK019","BK020")
Protein <- c(13.25287,14.34778,13.87116,14.00869,14.77897,14.43378,15.89361,15.96695,13.78778, 12.84457,12.99955,14.28378,14.15799,12.42578,14.80507,13.56095,15.26557,14.45378,13.06739,
             14.34230)
my_df <- data.frame(Genotype, Protein)

# combinations - sets of 2
combinations = merge(my_df,my_df,by=NULL)
# combinations - sets of 3
combinations = merge(combinations,my_df,by=NULL)
# combinations - variance
combinations$variance = apply(combinations[,c('Protein','Protein.x','Protein.y')], 1, var, na.rm=TRUE)

combinations %>% 
  arrange(-variance) %>% 
  head(10)

# Genotype.x Protein.x Genotype.y Protein.y Genotype  Protein variance
# 1       BK014  12.42578      BK008  15.96695    BK008 15.96695 4.179962
# 2       BK008  15.96695      BK014  12.42578    BK008 15.96695 4.179962
# 3       BK014  12.42578      BK014  12.42578    BK008 15.96695 4.179962
# 4       BK008  15.96695      BK008  15.96695    BK014 12.42578 4.179962
# 5       BK014  12.42578      BK008  15.96695    BK014 12.42578 4.179962
# 6       BK008  15.96695      BK014  12.42578    BK014 12.42578 4.179962
# 7       BK014  12.42578      BK008  15.96695    BK007 15.89361 4.095185
# 8       BK008  15.96695      BK014  12.42578    BK007 15.89361 4.095185
# 9       BK014  12.42578      BK007  15.89361    BK008 15.96695 4.095185
# 10      BK007  15.89361      BK014  12.42578    BK008 15.96695 4.095185
Claudio Paladini
  • 1,000
  • 1
  • 10
  • 20
  • Thank you for offering a possible solution. It's great for starters but I already see that some genotypes will be repeating twice per row and I seek for a set of unique 10 genotypes. I guess the number of repeated genotypes per row could only go bigger with number of combinations since my final goal is to choose 75 out of 150 genotypes. – IvanaOs Oct 19 '21 at 12:28