0

I have a data.frame with 2.5 million obs. of 32 variables, all factors. One variable consists numbers between 0 and 999. I want to convert all the numbers above 99 to NA because the model only accepts numbers with 2 digits.

Thanks,

Tim

Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
Tim_Utrecht
  • 1,459
  • 6
  • 24
  • 44
  • 1
    Welcome to SO. Please [How to make a great R reproducible example?](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Also you should show [what have you tried](http://mattgemmell.com/2008/12/08/what-have-you-tried/). – zero323 Nov 04 '13 at 08:33
  • Excuse me. I have tried to set the values larger than 99 to NA with the following formula: dataframe[dataframe$postcode > 99] <- NA. Then it gives error: Error in `[<-.data.frame`(`*tmp*`, dataframe$postcode > 99, value = NA) : missing values are not allowed in subscripted assignments of data frames – Tim_Utrecht Nov 04 '13 at 08:39
  • 1
    Thanks, but I think what @zero323, and me too, wants is that you add that to the question along with a small part of you data frame ([created by `dput`](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for instance) that we can test our solutions on. – Backlin Nov 04 '13 at 08:42
  • 2
    Try: `dataframe[dataframe$postcode > 99,"postcode"] <- NA` – James Nov 04 '13 at 08:43
  • `set.seed(10); dataframe <- data.frame(postcode=as.factor(round(runif(10, 1,200)))); dataframe[as.numeric(levels(dataframe$postcode)[dataframe$postcode]) > 99, 'postcode'] <- NA` – zero323 Nov 04 '13 at 08:45
  • @James Correct me if I'm wrong but I think that `dataframe$postcode > 99` won't work for factors. – zero323 Nov 04 '13 at 08:51
  • 1
    @zero323 That's true, in that case use `nchar(as.character(dataframe$postcode))>2` – James Nov 04 '13 at 09:19
  • @James Hard to generalize but neat. – zero323 Nov 04 '13 at 09:31
  • Thanks for you comments. @zero323: The formula does not work for me. and @James Hard: It does indeed not work for factors, when I use your adjustment: dataframe[nchar(as.character(dataframe$postcode))>2] <- NA. gives the following error: Error in `[<-.data.frame`(`*tmp*`, nchar(as.character(data.read$PropertyPostcode)) > : duplicate subscripts for columns. – Tim_Utrecht Nov 04 '13 at 09:52
  • After head(dput) I get fhe following:structure(c(3L, 46L, 66L, 2L, 59L, 30L), .Label = c("10", "11".........) – Tim_Utrecht Nov 04 '13 at 09:57
  • 1
    @Tim, you need to use two arguments to `[` as in my original comment, otherwise you will only extract complete columns – James Nov 04 '13 at 10:06

1 Answers1

1
######making example data set######
ex=matrix(as.factor(rnorm(6,100,10)),3,2)

ex

#           [,1]      [,2]
# [1,] 113.29893 101.54136
# [2,]  91.55164 101.45872
# [3,] 101.14473  88.19593

ex2=data.frame(ex)
###### solution ######    
ex3=apply(ex2,2,as.numeric)

ex3[ex3>99]=NA

ex3
#         X1       X2
# 1       NA       NA
# 2 91.55164       NA
# 3       NA 88.19593
Tay Shin
  • 528
  • 4
  • 17