0

I have a large data set where I want to convert variables to numeric. The data has been entered in excel with some items given a name which corresponds to a number (so data is type character), but sometimes this has been mis-spelt, or unusual terminology used. I need to identify all items in the dataframe which are strings which could not be converted to numeric.

Here is MWE:

num = data.frame(var1 = c("2", "green", "5"),
                 var2 = c("blue","4",  "9"),
                 var3 = c("ble", "4", "1"),
                 var4 = c("5", "7", "big"))

In this example the output would be c("green","blue","ble", "big")

I will then convert these to the relevant number as follows:

num%>%
  mutate(across(contains("var"), ~ str_replace_all(., c("green" = "3", "blue|ble" = "3.5", "big" = "10"))))

before converting to numeric

Mark Davies
  • 787
  • 5
  • 18

1 Answers1

2

Unlist to create an atomic vector then select the elements with contain anything that is not a number.

grep("\\D", unlist(num, use.names = FALSE), value = TRUE)
# [1] "green" "blue"  "ble"   "big"  

If you have decimal separators "." then the above will fail and smth like the following should be used

grep("^(\\d|\\.)+$", unlist(num, use.names = FALSE), value = TRUE, invert = TRUE)

Then you can create a lookup vector to replace them and convert to numeric:

lut = c("green" = 10, "blue" = 20, "ble" = 30, "big" = 40)
num[] <- lapply(num, \(x) as.numeric(ifelse(grepl("\\D", x), lut[x], x)))

#   var1 var2 var3 var4
# 1    2   20   30    5
# 2   10    4    4    7
# 3    5    9    1   40
s_baldur
  • 29,441
  • 4
  • 36
  • 69