0

I have data in which some of the columns have a mix of character and numeric data. For example:

Sensor  Sensor2   Sensor3
Fail    5         Fail
Fail    5         Fail
Fail    5         Fail
Fail    5         Fail
 5      5         Fail
 6      5         Fail
 8      5         Fail

All sensor values should be numeric, but R automatically take Sensor1 and Sensor3 as character values. I want to replace "Fail" with NA or some numeric values, or change the whole Column type to Numeric type so 'Fail' will be changed to NA automatically.

My problem is that my data set is really large (sensor data generated every second, with more than 200 variables). I want to identify all the columns that contains 'Fail' and change these columns to numeric type.

This is what I tried: I wrote a function

function(mydata, value1, value2){
   mydata <- data.frame(lapply(mydata, function(x){              
    gsub(value1, value2 ,x)
     }))
 }

This function works, but the only problem is that the whole data frame was changed to factor type. I have a mix of numeric and character that I want to keep the data type correct. So I tried

   function(mydata, value1, value2){
    mydata <- data.frame(lapply(mydata, function(x){              
    is.numeric(gsub(value1, value2 ,x))
     }))
   }

This changed my columns all to numeric, so I missed all the character column information.

How can I identify only the 'Fail' columns and make the change only to those columns and update my data frame with the change? Thanks.

Henrik
  • 65,555
  • 14
  • 143
  • 159
MiRe Y.
  • 57
  • 8
  • 1
    If you use any of the `read.xyz` functions to read your data, e.g. `read.table`, you may use the `na.strings` argument, like `na.strings = c( – Henrik Aug 28 '17 at 18:24
  • Related: [Replacing character values with NA in a data frame](https://stackoverflow.com/questions/3357743/replacing-character-values-with-na-in-a-data-frame/3357804#3357804) – Henrik Aug 28 '17 at 18:39
  • Yes, Henrik. I know this methods. I am writing some auto functions that can do this job in a series of data preprocessing/cleaning steps because our users usually just use the data import button to import data themselves. Some cleaning job was done after data was imported. Thanks a lot! – MiRe Y. Aug 28 '17 at 19:14

1 Answers1

0

I create example data with

data <- read.table(text = "Sensor  Sensor2   Sensor3
                   Fail    5         Fail
                   Fail    5         Fail
                   Fail    5         Fail
                   Fail    5         Fail
                    5      5         Fail
                    6      5         Fail
                    8      5         Fail",
                   header = TRUE, stringsAsFactors = FALSE)
data$char <- letters[1:nrow(data)]

The last line introduces an additional column that should not be converted to numeric.

Note that I use stringsAsFactors = FALSE to make sure that no columns turn out to be factors. You should definitely also check in your data that it does not contain any factor variables before you do the conversion.

The first step is to identify columns that contain "Fail" somewhere:

fail_cols <- sapply(data, function(x) any(grepl("Fail", x)))

Then, I define a function that replaces "Fail" with NA and converts to numeric:

to_num <- function(x) {
  as.numeric(gsub("Fail", NA, x))
}

Note that I use as.numeric(), not is.numeric(). It is the former that converts to numeric, the latter checks whether a variable is numeric.

This function is now applied only to those columns of the data frame that contain "Fail":

data[fail_cols] <- lapply(data[fail_cols], to_num)
data
##   Sensor Sensor2 Sensor3 char
## 1     NA       5      NA    a
## 2     NA       5      NA    b
## 3     NA       5      NA    c
## 4     NA       5      NA    d
## 5      5       5      NA    e
## 6      6       5      NA    f
## 7      8       5      NA    g

You already pointed out in the question that simply using as.numeric() instead of to_num() would give the same result. It would have, however, the disadvantage that it issues warning messages. In order to avoid these, it may still be helpful to define to_num().

Stibu
  • 15,166
  • 6
  • 57
  • 71
  • Hi, Stibu, thanks for your reply, I really appreciate it. I think is.numeric is just a typo, I did have as.numeric in my function. First of all, this step is part of my large data sets preprocessing/cleaning. I don't want to save the previous results as tables and read them again for this step; And secondly I just tried your solution. I think your solution have the same problem with my own solution. All the other character-type columns are changed to NA too. – MiRe Y. Aug 28 '17 at 18:51
  • Sorry, I missed the point about the columns that should remain characters. Unfortunately, your example data did not contain a column of that sort. Try to create more realistic example data in your next question... – Stibu Aug 28 '17 at 20:20
  • Thank you very much, Stibu. Problem solved! Appreciate it. – MiRe Y. Aug 30 '17 at 15:03