detect non ascii characters in a string

Question

How can I detect non-ascii characters in a vector of strings in a grep like fashion. For example below I'd like to return c(1, 3) or c(TRUE, FALSE, TRUE, FALSE):

x <- c("façile test of showNonASCII(): details{", 
    "This is a good line", "This has an ümlaut in it.", "OK again. }")

Attempt:

y <- tools::showNonASCII(x)
str(y)
p <- capture.output(tools::showNonASCII(x))

@David I think that will do it... can you throw down as an answer. Maybe others will see an issue with it or have different solutions. — Tyler Rinker, Jan 05 '16 at 14:23
Why not fix the code so it handles Unicode properly instead? — Panagiotis Kanavos, Jan 05 '16 at 14:26
@PanagiotisKanavos I will, that's easy, but this is to validate strings so I first need to detect if there's a problem with the data so as to inform the client. — Tyler Rinker, Jan 05 '16 at 14:29
Why are Latin1 characters considered a problem? Are you trying to detect some *other* problem perhaps, eg invalid codepage conversions? — Panagiotis Kanavos, Jan 05 '16 at 14:34
@PanagiotisKanavos b/c it's data from a client. We want it in a particular format. Non-standard data format is a data scientist's enemy, particularly if you're trying to automate a task. It's far easier and cheaper to get clients to put data in the correct format than to try to clean up and address un-foreseen errors later. — Tyler Rinker, Jan 05 '16 at 14:42
[related question](http://stackoverflow.com/questions/9934856/removing-non-ascii-characters-from-data-files) — Mutador, Jan 05 '16 at 14:49

score 25 · Answer 1 · answered Jan 15 '16 at 03:11

25

Came across this later using pure base regex and so simple:

grepl("[^ -~]", x)
## [1]  TRUE FALSE  TRUE FALSE

More here: http://www.catonmat.net/blog/my-favorite-regex/

answered Jan 15 '16 at 03:11

Tyler Rinker

108,132
65
322
519

From the link: "[ -~] matches all ASCII characters from the space to tilde. What are these characters? These are all printable characters!" – sbha Mar 22 '21 at 14:51
Short, simple, smart: simply beautiful! – Laurent Bergé Sep 24 '21 at 14:57

score 21 · Accepted Answer · answered Jan 05 '16 at 14:48

21

another possible way is to try to convert your string to ASCII and the try to detect all the generated non printable control characters which couldn't be converted

grepl("[[:cntrl:]]", stringi::stri_enc_toascii(x))
## [1]  TRUE FALSE  TRUE FALSE

Though it seems stringi has a built in function for this type of things too

stringi::stri_enc_mark(x)
# [1] "latin1" "ASCII"  "latin1" "ASCII"

answered Jan 05 '16 at 14:48

David Arenburg

91,361
17
137
196

3

Both solutions are terrific. This one is a bit more compact and may be more robust to other encodings, though, admittedly, I know very little about encodings. – Tyler Rinker Jan 05 '16 at 15:15

score 12 · Answer 3 · answered Jan 05 '16 at 14:47

12

Why don't you extract the relevant code from showNonASCII?

x <- c("façile test of showNonASCII(): details{", 
       "This is a good line", "This has an ümlaut in it.", "OK again. }")

grepNonASCII <- function(x) {
  asc <- iconv(x, "latin1", "ASCII")
  ind <- is.na(asc) | asc != x
  which(ind)
}

grepNonASCII(x)
#[1] 1 3

answered Jan 05 '16 at 14:47

Roland

127,288
10
191
288

The iconv function appears to remove the variable label attributes randomly in a dataframe when applied. What could be reasons? – Heatshock Sep 17 '22 at 17:20
@Heatshock I have no idea what you are doing. `iconv` should preserve attributes. – Roland Sep 19 '22 at 05:37
I read in a sas dataset in xpt format. here is code. dv <-read_xpt('adsl.xpt'). dv1<-dv |> mutate(across(everything(), ~iconv(.,"latin2", "ascii") )) then use str(dv1). Some of the variable labels get lost in a consistent way. but not all variables. Do you know why? – Heatshock Sep 22 '22 at 20:34
No, I don't. Might be due to your use of dplyr. – Roland Sep 23 '22 at 04:58

score 7 · Answer 4 · answered Jul 30 '19 at 13:55

A bit late I guess but it could be useful for the next readers.

You can find these functions:

showNonASCII(<character_vector>)
showNonASCIIfile(<file>)

in the tools R package (see https://stat.ethz.ch/R-manual/R-devel/library/tools/html/showNonASCII.html). It does exactly what is asked here: show non ASCII characters in a string or in a text file.

score 2 · Answer 5 · answered Oct 07 '21 at 15:10

A stringr regex solution:

library(stringr)
x <- c("façile test of showNonASCII(): details{", 
    "This is a good line",
    "This has an ümlaut in it.", "OK again. }")
str_detect(x, "[^[:ascii:]]")
# => [1]  TRUE FALSE  TRUE FALSE

The [^[:ascii:]] pattern matches any non-ASCII character.

The [[:ascii:]] pattern matches any ASCII character.

If you ever need to make sure the whole string consists of non-ASCII chars, use

str_detect(x, "^[^[:ascii:]]+\\z")

where ^ matches the start of string and \z matches the very end of string.

detect non ascii characters in a string

5 Answers5

Linked

Related