-1

I need to convert a messy factor into a numeric. The sample data looks like this:

x <- structure(c(4L, 5L, 1L, 6L, 6L, 2L, 3L), 
    .Label = c("", "106", "39", "8", "80", "chyb\x92 foto"), class = "factor")

My desired output would be:

x
[1]   8  80  NA  NA  NA 106  39
class(x)
"numeric"

However, the first line of my intended code results in a warning and the text is not replaced with NAs.

x[grepl("[a-z]", x) | x==""] <- NA
x <- as.numeric(levels(x))[x]

Warning messages:
1: In grepl("[a-z]", x) : input string 4 is invalid in this locale
2: In grepl("[a-z]", x) : input string 5 is invalid in this locale

The second line then runs correctly and provides the correct output with NAs introduced by coercion. Why does grepl fail to recognise letters in some factor levels, and how can as.numeric pick them out and replace them with NAs?

The factor to numeric conversion was chosen from this question. However, the fact that it works does not answer my question why.

sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)

locale:
[1] cs_CZ.UTF-8/cs_CZ.UTF-8/cs_CZ.UTF-8/C/cs_CZ.UTF-8/cs_CZ.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.3.0
Community
  • 1
  • 1
nya
  • 2,138
  • 15
  • 29
  • 1
    grep takes the integer part of the factor (the vector), not the levels. Maybe read `?factor` ? – Tensibai Jul 26 '16 at 09:17
  • 1
    I get the desired output with your lines, without the warnings – Cath Jul 26 '16 at 09:19
  • @Tensibai I don't thing the factor is the problem. `grepl("a-z", levels(x))` generates the same warning. – nya Jul 26 '16 at 09:20
  • I don't get a warning with that line so probably there is something you're not telling us (maybe try your example and your lines in a fresh session) and/or you should give us your sessionInfo – Cath Jul 26 '16 at 09:21
  • @Cath I added the sessionInfo. The warning persist with the fresh run. – nya Jul 26 '16 at 09:26
  • @Tensibai Tried. Problem not solved. Note that my `grepl` pattern already is as you suggest. – nya Jul 26 '16 at 09:31
  • @nya seems the problem comes from `"chyb\x92 foto"`, what does it print if you do `t<-"chyb\x92 foto";print(t)` (for me the \x92 is a single quote) ? – Tensibai Jul 26 '16 at 09:36
  • @Tensibai `[1] "chyb\x92 foto"` In the original csv file, the text reads "chybí foto". – nya Jul 26 '16 at 09:42
  • 1
    @nya ok so we get the cause (more or less), now the why seems to be an invalid char in Czech locale – Tensibai Jul 26 '16 at 09:45
  • 1
    One more question: could you add the results of `Encoding(levels(x))` ? (And maybe `levels(x) <- enc2utf8(levels(x))` could solve the problem) – Tensibai Jul 26 '16 at 09:56
  • @Tensibai Thank you, that solved the `grepl` issue. I added my current understanding of the problem as an answer. – nya Jul 26 '16 at 10:15

2 Answers2

1

We can just do

as.numeric(as.character(x))
#[1]   8  80  NA  NA  NA 106  39

If we are using grepl, we will make sure that we are only finding the numeric part from start (^) to end ($) of string and negate (!) it and then assign those values to NA. As 'x' is a factor, we can convert to numeric by as.numeric(as.character.

 x[!grepl("^[0-9.]+$", x)] <- NA
 as.numeric(as.character(x))
 #[1]   8  80  NA  NA  NA 106  39
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Yes. But why cannot `grepl` find cells with text? – nya Jul 26 '16 at 09:09
  • @nya It got dupe tagged. Please check if that works for you or else we can reopen can I will answer that part – akrun Jul 26 '16 at 09:14
  • Absolutely not. My code `x <- as.numeric(levels(x))[x]` solves the conversion as well as your answer. My question is why `grepl` fails where `as.numeric` succeeds? – nya Jul 26 '16 at 09:16
  • @nya I added the `grepl` part – akrun Jul 26 '16 at 09:19
  • Thank you, but it still gives me the "input string is invalid in this locale" warning. – nya Jul 26 '16 at 09:22
  • 1
    @nya Using your example, I am not getting the warning with R 3.3.0. Probably, you need to change the settings to have utf-8 characters – akrun Jul 26 '16 at 09:22
  • @nya You can check the `sessionInfo()` . For the locale, I have `locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252` – akrun Jul 26 '16 at 09:25
  • I added the sessionInto to the question. In essence, I wish to remove all text, including any accented characters. – nya Jul 26 '16 at 09:33
  • @nya You can check [here](http://stackoverflow.com/questions/13575180/how-to-change-language-settings-in-r) for changing the locale – akrun Jul 26 '16 at 09:35
1

It seems I found the solution. Thanks to akrun, Cath and Tensibai for pointing me towards Encoding. My levels(x) were encoded as "unknown", for which grepl found values with text when it was instructed to read bytes:

grepl("[a-z]", x, useBytes = TRUE)
[1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE

Tensibar's suggestion to specify the encoding provides identical functionality for grepl.

levels(x) <- enc2utf8(levels(x))
grepl("[a-z]", x, useBytes = FALSE)

Unlike complex ability of grepl to deal with accented characters and various encoding, as.numeric takes an object and finds if it can be interpretable as a number. Which any text, regardless of encoding, is not.

Using as.numeric(levels(x))[x] for factor conversion might be a safe method to use by itself without the need to check for problematic values first.

nya
  • 2,138
  • 15
  • 29