1

I'm trying to eliminate columns in a large data set if there are too many NA values in the column. There are 1007 variables in the data set. I came up with the following code but I don't think it is working.

> for(i in 1:1007){
+ if (length(which(is.na(train3[i])=="TRUE"))>1955) train3[i]<-NULL
+ else train3[i]<-train3[i]
+ }
Error in which(is.na(train3[i]) == "TRUE") : 
  error in evaluating the argument 'x' in selecting a method for function 'which': Error in `[.data.frame`(train3, i) : undefined columns selected

So I'm trying to eliminate the columns which has more than 1955 NAs. Will there be a way to make this work?

halo09876
  • 2,725
  • 12
  • 51
  • 71
  • The error looks like it's trying to say that `i` is too big. I have never used `r` before, so I don't know if it's shifting the columns every time you remove one of them, but if you can do a for loop from 1007 to 1, maybe that will work for you? – jonhopkins Dec 02 '13 at 13:16
  • train3[,i] doesn't work. I'm not sure what's wrong with the code since length(which(is.na(train2[477])=="TRUE")) this returns a integer. – halo09876 Dec 02 '13 at 13:25

2 Answers2

2

Code not tested, since the question doesn't provide example data:

train3 <- train3[, sapply(train3, function(x) sum(is.na(x))<=1955)]
Roland
  • 127,288
  • 10
  • 191
  • 288
0

I created a smaller matrix [100 X 1007], but you can adapt it:

#MAKE UP SOME SAMPLE DATA
d<-sample(c(c(1:10),c(5:9),rep(NA,times=8)),size=100700,replace=TRUE)
train3<-data.frame(matrix(d,nrow=100))

#GET THE NA COUNTS PER COLUMN
counts<-apply(train3,2,function(x)length(x[is.na(x)]))
#SELECT ALL COLUMNS WITH LESS THAN 35 NA's (modify to 1945)
train3[,names(counts[counts<35])]
Troy
  • 8,581
  • 29
  • 32