2

I have a dataframe with a combination of numeric, and factor variables.

I am trying to recursively replace all outliers (3 x SD) with NA however I'm having problems with the following error

Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

The code i was using is

name = factor(c("A","B","NA","D","E","NA","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
data[is.na(data)] <- 77777 
data.scale <-  scale(data)
data.scale[ abs(data.scale) > 3 ] <- NA
data <- data.scale

Any suggestions on how to get this working?

GlenCloncurry
  • 457
  • 3
  • 5
  • 15
  • 1
    Including a [reproducible example](http://stackoverflow.com/questions/5963269) will make it much easier for others to help you. – Jaap Sep 13 '17 at 09:25
  • 2
    If you're talking about outliers your variable shouldn't probably be a factor – moodymudskipper Sep 13 '17 at 09:34
  • 1
    You are doing mathematical application on a data frame which not contains only numeric values. Use `data = data.frame(mark=mark,age=age,height=height)`, without the `name` column. Run the rest of the code and add the line `data<-cbind(name,data)` at the end. – Smich7 Sep 13 '17 at 09:47

1 Answers1

1

Here's one approach:

library(dplyr)

# take note of order for column names
data.names <- colnames(data)

# scale all numeric columns
data.numeric <- select_if(data, is.numeric) %>% # subset of numeric columns
  mutate_all(scale)                             # perform scale separately for each column
data.numeric[data.numeric > 3] <- NA            # set values larger than 3 to NA (none in this example)

# combine results with subset data frame of non-numeric columns
data <- data.frame(select_if(data, function(x) !is.numeric(x)),
                   data.numeric)

# restore columns to original order
data <- data[, data.names]

> data
  name        mark         age     height
1    A  0.20461856 -0.80009469 -1.0844636
2    B -1.43232992 -0.55391171         NA
3   NA  0.20461856 -1.04627767 -0.1459855
4    D -0.61796862 -0.30772873  0.4796666
5    E  0.04010112 -0.06154575         NA
6   NA  0.20461856  0.18463724 -0.2711159
7    G          NA  0.43082022 -0.7090723
8    H -0.61796862          NA  1.7309707
9    H  2.01431035  2.15410109         NA

Note: the non-numeric (character / factor / etc) variables will be ordered before the numeric variables in this approach. Hence the last step restores the original order (if applicable).

Z.Lin
  • 28,055
  • 6
  • 54
  • 94