R: Remove words written in alphabets other than latin (but including diacritics)

Question

I want to search a text for Italian geographic entities based on the geonames database, which I can download with

download.file('http://download.geonames.org/export/dump/IT.zip', destfile = 'IT.zip')
unzip('IT.zip', exdir = 'IT')
require(readr)
it_gn <- read_delim("IT/IT.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)

Column X4 includes alternative versions of the geographic name, including versions in other languages.

For example

it_gn$X4[it_gn$X1 == 2522713]
# [1] "Vittoira,Vittoria,vu~ittoria,ヴィットーリア"

Since the document I'm searching is in Italian I want to remove all names written in alphabets other than the latin alphabet, but including diacritics used in Italians: -à , -è , -é , -ì , -ò , -ù and also -á , -í , -ó , -ú (these are formally used in Italian but they might appear). But it is not clear which regex I should use to identify non-latin alphabets.

I tried to apply this answer, but the regex doesn't seem to make any difference...

grepl('[^\\x00-\\x7F]', 'ヴィットーリア')
# [1] TRUE

grepl('[^\\x00-\\x7F]', 'Vittoria')
# [1] TRUE

score 4 · Accepted Answer · answered Nov 26 '16 at 01:59

First, the reason your regexps aren't working is that the regexp escape "\xNN" is a Perl extension, so you need to pass "perl=TRUE" if you want to use it:

> grepl('[^\\x00-\\x7F]', 'ヴィットーリア', perl=TRUE)
[1] TRUE
> grepl('[^\\x00-\\x7F]', 'Vittoria', perl=TRUE)
[1] FALSE
>

(Confusingly, the following will work:

> grepl('[^\x01-\x7F]', 'ヴィットーリア')
[1] TRUE
> grepl('[^\x01-\x7F]', 'Vittoria')
[1] FALSE
>

because without the double backslash, you're using an R string literal escape sequence "\xNN" instead of the regexp escape sequence above; this embeds the given byte directly in the string regardless of encoding, and it's pretty bad practice, so I'd avoid it here.)

That being said, I think the most readable approach is to just include the Unicode characters in your R code:

isinvalid <- grepl('[^[:ascii:]àèéìòùáíóú]', name,
                   perl=TRUE, ignore.case=TRUE)

The perl=TRUE allows you to use [:ascii:] which despite being ugly, seems more readable than the alternatives, and the ignore.case=TRUE is necessary if you want capitalized versions of the accented characters to be treated as valid as well.

If your environment is too screwed up to include Unicode in your source code, then you can use normal "\u" escapes to include them:

isinvalid <- grepl('[^[:ascii:]\ue0\ue8\ue9\uec\uf2\uf9\ue1\ued\uf3\ufa]', name,
                   perl=TRUE, ignore.case=TRUE)

Note that you should use "\u" escapes and not "\x" escapes here. These are unicode code points, rather than bytes inserted directly into the string. (Again, rather strangely, you could also use \\x escapes, taking advantage of the Perl extension, because -- rather bafflingly -- the Perl regexp "\x" escape acts more like R's string literal "\u" escape rather than its "\x" escape.

Ugh... Anyway, I hope the extra explanation made things more clear rather than less.

In R `'\x20'` will get displayed as `" "`. I was getting an error about nul characters but your `[^\x01-\x7F]` was very helpful. — IRTFM, Nov 26 '16 at 02:12

IRTFM · Answer 2 · 2016-11-26T02:00:04.980

Using a caret in the first position of a character class negates it, which is not what you want. See this difference:

> grepl('[\\x00-\\x7F]', 'ヴィットーリア')
[1] FALSE
> 
>  grepl('[\\x00-\\x7F]', 'Vittoria')
[1] TRUE

And more to your needs:

> sapply( strsplit( 'ヴィットーリア', ""), grepl, patt='[\\x00-\\x7F]')
      [,1]
[1,] FALSE
[2,] FALSE
[3,] FALSE
[4,] FALSE
[5,] FALSE
[6,] FALSE
[7,] FALSE

For reasons that I don't understand, I needed to use a longer patt argument to match lowercase ASCII:

patt='[A-Za-z\\x00-\\x7F]'

And also to your further point I will need to explicitly include a string of those diacriticals.

> sapply( c('à' , 'è' , 'é' , 'ì' , 'ò' , 'ù' ,  'á' , 'í' , 'ó' , 'ú'), grepl, patt='[A-Za-z\\x00-\\x7F]')
    à     è     é     ì     ò     ù     á     í     ó     ú 
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

So you might try:

patt = "[A-Za-zàèéìòùáíóú]"

And also decide whether you want to exclude the line-feeds and tabs.

Actually, as I noted in my other answer, the reason the \\x escapes aren't working is that Perl regexps aren't enabled. Instead, your regexp is being treated as one that matches a backslash, an "x", a "0", a "0", any character between "0" and backslash, an "x", a "7" or an "F". Since the characters between "0" and backslash include the ASCII uppercase, it matches "Vittoria", but it wouldn't match "vitorria" in all lowercase, for example. — K. A. Buhr, Nov 26 '16 at 02:04
I should have used ` patt='[\\x01-\\xFF]', perl=TRUE)`. Works against all the ヴィットーリア and allows all the ASCII and diacriticals. — IRTFM, Nov 26 '16 at 02:16

R: Remove words written in alphabets other than latin (but including diacritics)

2 Answers2