Identify rows that contain Chinese or Arabic characters in a dataframe in R

Question

Being unsure whether a huge dataframe contains Chinese or Arabic characters, I would like to find out (a) whether there are indeed such values in a given column, and (b) if so, I would like to subset the respective rows.

Would that be possible in R? If so, how?

Here is an exemplrary dataframe:

> DF <- data.frame(Var = c("Test1", "Another test", "Oranges"), Names = c("汉字", "Lioba", "الْأَبْجَدِيَّة الْعَرَبِيَّة"))

> dput(DF)

structure(list(Var = c("Test1", "Another test", "Oranges"), Names = c("<U+6C49><U+5B57>", 
"Lioba", "<U+0627><U+0644><U+0652><U+0623><U+064E><U+0628><U+0652><U+062C><U+064E><U+062F><U+0650><U+064A><U+064E><U+0651><U+0629> <U+0627><U+0644><U+0652><U+0639><U+064E><U+0631><U+064E><U+0628><U+0650><U+064A><U+064E><U+0651><U+0629>"
)), class = "data.frame", row.names = c(NA, -3L))

Maybe: [Removing text containing non-english character](https://stackoverflow.com/questions/43049015/removing-text-containing-non-english-character) — Henrik, Feb 05 '21 at 11:39

pookpash · Answer 1 · 2021-02-05T16:51:57.743

Quick solution where you scan for non-latin characters (This works on the dataframe you provided, if it should scan explicitly for arabic and chinese you would need to work a bit on the grep() line):

DF <- data.frame(Var = c("Test1", "Another test", "Oranges"), 
                 Names = c("汉字", "Lioba", "الْأَبْجَدِيَّة الْعَرَبِيَّة"))

helpvec <- DF$Names
#find indices of words with non-ASCII characters using iconv()
nonlatin_ind <- grep("helpvec", iconv(helpvec, "latin1", "ASCII", sub="helpvec"))
# create new column that says whether non-latin has been found or not
DF$Test[nonlatin_ind] <- "non-latin found here"

#to subset the rows just use
DF[nonlatin_ind,]
#or to exclude
DF[-nonlatin_ind,]

Identify rows that contain Chinese or Arabic characters in a dataframe in R

1 Answers1