I want to search a text for Italian geographic entities based on the geonames database, which I can download with
download.file('http://download.geonames.org/export/dump/IT.zip', destfile = 'IT.zip')
unzip('IT.zip', exdir = 'IT')
require(readr)
it_gn <- read_delim("IT/IT.txt", "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
Column X4
includes alternative versions of the geographic name, including versions in other languages.
For example
it_gn$X4[it_gn$X1 == 2522713]
# [1] "Vittoira,Vittoria,vu~ittoria,ヴィットーリア"
Since the document I'm searching is in Italian I want to remove all names written in alphabets other than the latin alphabet, but including diacritics used in Italians: -à , -è , -é , -ì , -ò , -ù
and also -á , -í , -ó , -ú
(these are formally used in Italian but they might appear). But it is not clear which regex I should use to identify non-latin alphabets.
I tried to apply this answer, but the regex doesn't seem to make any difference...
grepl('[^\\x00-\\x7F]', 'ヴィットーリア')
# [1] TRUE
grepl('[^\\x00-\\x7F]', 'Vittoria')
# [1] TRUE