34

How can I detect non-ascii characters in a vector of strings in a grep like fashion. For example below I'd like to return c(1, 3) or c(TRUE, FALSE, TRUE, FALSE):

x <- c("façile test of showNonASCII(): details{", 
    "This is a good line", "This has an ümlaut in it.", "OK again. }")

Attempt:

y <- tools::showNonASCII(x)
str(y)
p <- capture.output(tools::showNonASCII(x))
sbha
  • 9,802
  • 2
  • 74
  • 62
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • 3
    Maybe `stringi::stri_enc_mark(x)`? – David Arenburg Jan 05 '16 at 14:20
  • 1
    @David I think that will do it... can you throw down as an answer. Maybe others will see an issue with it or have different solutions. – Tyler Rinker Jan 05 '16 at 14:23
  • 1
    Why not fix the code so it handles Unicode properly instead? – Panagiotis Kanavos Jan 05 '16 at 14:26
  • 1
    @PanagiotisKanavos I will, that's easy, but this is to validate strings so I first need to detect if there's a problem with the data so as to inform the client. – Tyler Rinker Jan 05 '16 at 14:29
  • Why are Latin1 characters considered a problem? Are you trying to detect some *other* problem perhaps, eg invalid codepage conversions? – Panagiotis Kanavos Jan 05 '16 at 14:34
  • 3
    @PanagiotisKanavos b/c it's data from a client. We want it in a particular format. Non-standard data format is a data scientist's enemy, particularly if you're trying to automate a task. It's far easier and cheaper to get clients to put data in the correct format than to try to clean up and address un-foreseen errors later. – Tyler Rinker Jan 05 '16 at 14:42
  • [related question](http://stackoverflow.com/questions/9934856/removing-non-ascii-characters-from-data-files) – Mutador Jan 05 '16 at 14:49

5 Answers5

25

Came across this later using pure base regex and so simple:

grepl("[^ -~]", x)
## [1]  TRUE FALSE  TRUE FALSE

More here: http://www.catonmat.net/blog/my-favorite-regex/

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
21

another possible way is to try to convert your string to ASCII and the try to detect all the generated non printable control characters which couldn't be converted

grepl("[[:cntrl:]]", stringi::stri_enc_toascii(x))
## [1]  TRUE FALSE  TRUE FALSE

Though it seems stringi has a built in function for this type of things too

stringi::stri_enc_mark(x)
# [1] "latin1" "ASCII"  "latin1" "ASCII" 
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 3
    Both solutions are terrific. This one is a bit more compact and may be more robust to other encodings, though, admittedly, I know very little about encodings. – Tyler Rinker Jan 05 '16 at 15:15
12

Why don't you extract the relevant code from showNonASCII?

x <- c("façile test of showNonASCII(): details{", 
       "This is a good line", "This has an ümlaut in it.", "OK again. }")

grepNonASCII <- function(x) {
  asc <- iconv(x, "latin1", "ASCII")
  ind <- is.na(asc) | asc != x
  which(ind)
}

grepNonASCII(x)
#[1] 1 3
Roland
  • 127,288
  • 10
  • 191
  • 288
  • The iconv function appears to remove the variable label attributes randomly in a dataframe when applied. What could be reasons? – Heatshock Sep 17 '22 at 17:20
  • @Heatshock I have no idea what you are doing. `iconv` should preserve attributes. – Roland Sep 19 '22 at 05:37
  • I read in a sas dataset in xpt format. here is code. dv <-read_xpt('adsl.xpt'). dv1<-dv |> mutate(across(everything(), ~iconv(.,"latin2", "ascii") )) then use str(dv1). Some of the variable labels get lost in a consistent way. but not all variables. Do you know why? – Heatshock Sep 22 '22 at 20:34
  • No, I don't. Might be due to your use of dplyr. – Roland Sep 23 '22 at 04:58
7

A bit late I guess but it could be useful for the next readers.

You can find these functions:

  • showNonASCII(<character_vector>)
  • showNonASCIIfile(<file>)

in the tools R package (see https://stat.ethz.ch/R-manual/R-devel/library/tools/html/showNonASCII.html). It does exactly what is asked here: show non ASCII characters in a string or in a text file.

Odin
  • 633
  • 4
  • 11
2

A stringr regex solution:

library(stringr)
x <- c("façile test of showNonASCII(): details{", 
    "This is a good line",
    "This has an ümlaut in it.", "OK again. }")
str_detect(x, "[^[:ascii:]]")
# => [1]  TRUE FALSE  TRUE FALSE

The [^[:ascii:]] pattern matches any non-ASCII character.

The [[:ascii:]] pattern matches any ASCII character.

If you ever need to make sure the whole string consists of non-ASCII chars, use

str_detect(x, "^[^[:ascii:]]+\\z")

where ^ matches the start of string and \z matches the very end of string.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563