Removing text containing non-english character

Question

This is my sample dataset:

Name <- c("apple firm","苹果 firm","Ãpple firm")
Rank <- c(1,2,3)
data <- data.frame(Name,Rank)

I would like to delete the Name containing non-English character. For this sample, only "apple firm" should stay.

I tried to use the tm package, but it can only help me delete the non-english characters instead of the whole queries.

score 11 · Accepted Answer · edited May 23 '17 at 10:30

I would check out this related Stack Overflow post for doing the same thing in javascript. Regular expression to match non-English characters?

To translate this into R, you could do (to match non-ASCII):

res <- data[which(!grepl("[^\x01-\x7F]+", data$Name)),]

res
# A tibble: 1 × 2
#        Name  Rank
#       <chr> <dbl>
#1 apple firm     1

And to match non-unicode per that same SO post:

  res <- data[which(!grepl("[^\u0001-\u007F]+", data$Name)),]

  res
# A tibble: 1 × 2
#        Name  Rank
#       <chr> <dbl>
#1 apple firm     1

Note - we had to take out the NUL character for this to work. So instead of starting at \u0000 or x00 we start at \u0001 and \x01.

Henrik · Answer 2 · 2017-03-27T16:31:49.720

10

stringi package has the convenience function stri_enc_isascii:

library(stringi)
stri_enc_isascii(data$Name)
# [1]  TRUE FALSE FALSE

As the name suggests,

the function checks whether all bytes in a string are in the [ASCII] set 1,2,...,127 (from ?stri_enc_isascii).

edited Mar 27 '17 at 16:31

answered Mar 27 '17 at 14:51

Henrik

65,555
14
143
159

jess · Answer 3 · 2017-03-27T15:01:46.103

5

An alternative to regex would be to use iconv and than filter for non NA entries:

library(dplyr)
data <- data %>% 
         mutate(Name = iconv(Name, from = "latin1", to = "ASCII")) %>%
         filter(!is.na(Name))

What happens in the mutate statement is that the strings are converted from latin1 to ASCII. Here's a list of the characters covered by latin1 aka ISO 8859-1. When a string contains a character that is not on the latin1 list, it cannot be converted to ASCII and becomes NA.

edited Mar 27 '17 at 15:01

answered Mar 27 '17 at 14:46

jess

534
2
7

I have the same issue as @Frank, when I run `from = "ASCII"` and `to = "latin1"` it manages to convert the characters (although inaccurately) and doesn't give me the `NA`s. – Mike H. Mar 27 '17 at 15:00
2

@Frank, true dat, mixed up things. I edited my answer accordingly – jess Mar 27 '17 at 15:02

Removing text containing non-english character

3 Answers3

Linked