I have example data as below (real data is 150x200), and need to keep the best combination of rows and columns that would give me least NAs. I could use complete.cases()
but it removes too many rows.
Just by looking at this example, it is obvious to exclude row x6 as it has most NA
count. Similarly, we can exclude column A and F, as they have most NA
count.
Need some hint on logic, doesn't have to be full code solution.
#reproducible data
df <- read.csv(text="
SampleID,A,B,C,D,E,F
x1,NA,x,NA,x,NA,x
x2,x,x,NA,x,x,NA
x3,NA,NA,x,x,x,NA
x4,x,x,x,NA,x,x
x5,x,x,x,x,x,x
x6,NA,NA,NA,x,NA,NA
x7,x,x,x,NA,x,x
x8,NA,NA,x,x,x,x
x9,x,x,x,x,x,NA
x10,x,x,x,x,x,x
x11,NA,x,x,x,x,NA")
# complete cases
df[ complete.cases(df),]
# SampleID A B C D E F
#5 x5 x x x x x x
#10 x10 x x x x x x
Additional info: This is a data for risk calculation, rows are samples and columns are variables. Each variable has a risk factor of some value. Risk predicting algorithm (computed using different custom software) can work with say, with 5 variables or with 200. The more variables will give obviously more reliable answer. To be able to have comparable results most samples should have most overlapping variables. I will need to keep at least ~60% of samples - rows.