I have a raw data (csv file) which I loaded to R studio and I want to extract those columns which have non english data in it how could possibly I can do that?
Asked
Active
Viewed 60 times
0
-
https://stackoverflow.com/questions/34613761/detect-non-ascii-characters-in-a-string – novica Dec 22 '19 at 05:59
-
Does this answer your question? [detect non ascii characters in a string](https://stackoverflow.com/questions/34613761/detect-non-ascii-characters-in-a-string). If not, can you please add a sample data and clarify your question? – DJV Dec 22 '19 at 06:02
-
@DJV yes like I have a csv file with columns name, last name, city, address in these columns suppose city might have names like: New York, Delhi, کابل, تهران so I want to find those values which are not in English with their column name – Najla Naz Dec 22 '19 at 06:13
-
Came across this by chance: `grepl("[^ -~]", x)`. This matches any non-ASCII character; for more info check out http://www.catonmat.net/blog/my-favorite-regex/ – Chris Ruehlemann Dec 22 '19 at 10:12
-
1For example: `x <- c("New York", "Delhi", "کابلت هران", "ü", "ß") grepl("[^ -~]", x) [1] FALSE FALSE TRUE TRUE TRUE` – Chris Ruehlemann Dec 22 '19 at 10:19
2 Answers
0
You can use the stringi
package.
I will use @Chris Ruehlemann sample data.
Sample data:
x <- c("New York", "Delhi", "کابلت هران", "ü", "ß")
library(stringi)
grepl("[[:cntrl:]]", stringi::stri_enc_toascii(x))
Which will give you a TRUE
/FALSE
output:
[1] FALSE FALSE TRUE TRUE TRUE
Next, you can extract the non English values:
x[grepl("[[:cntrl:]]", stringi::stri_enc_toascii(x))]
[1] "کابلت هران" "ü" "ß"
If you'd like you can also use the base R solution:
grepl("[^ -~]", x)
Which will give you similar results:
[1] FALSE FALSE TRUE TRUE TRUE
However, if you'll benchmark both function, stringi::stri_enc_toascii
is faster.
library(rbenchmark)
# sample data
y <- sample(x, 1000, replace = TRUE)
benchmark(
"stringi" = {
grepl("[[:cntrl:]]", stringi::stri_enc_toascii(y))
},
"baseR" = {
grepl("[^ -~]", y)
},
replications = 1000,
columns = c("test", "replications", "elapsed",
"relative", "user.self", "sys.self"))
test replications elapsed relative user.self sys.self
2 baseR 1000 0.96 5.053 0.96 0.00
1 stringi 1000 0.19 1.000 0.19 0.01

DJV
- 4,743
- 3
- 19
- 34
0
Since you say that you want to select columns in your data frame whose names have the non-ASCII characters, here's a simple R base solution for that. Let's assume your data has this structure:
df <- data.frame(
NY = c("some", "data", "more", "data"),
Delhi = c("some", "data", "more", "data"),
کابل = c("some", "data", "more", "data"),
ß = c("some", "data", "more", "data")
)
df
NY Delhi کابل ß
1 some some some some
2 data data data data
3 more more more more
4 data data data data
Then all you have to do is subset df
on the columns of interest using grepl
to find matches of the pattern [^ -~]
in colnames
:
df[ , grepl("[^ -~]", colnames(df))]
کابل ß
1 some some
2 data data
3 more more
4 data data

Chris Ruehlemann
- 20,321
- 4
- 12
- 34
-
Thank you but I said data in a column I work that as below but it takes a little time to execute – Najla Naz Dec 24 '19 at 04:49
-
raw_data <- read_excel("./input/Extracted AFG1904_Emergency SNFI & Winterisation Assessment - all versions - False - 2019-12-23-04-50-46.xlsx") #Creates an empty data frame new_df <- data.frame(UUID = character(0), question_id = character(0), original_value = character(0)) – Najla Naz Dec 24 '19 at 04:50
-
#Creates variables for to_be_translated data frame uuid <- vector() question <- vector() original_value <- vector() #Looking for un-translated values – Najla Naz Dec 24 '19 at 04:51
-
for(coli in 1:ncol(raw_data)){ for(rowi in 1:nrow(raw_data)){ result <- grepl(raw_data[rowi,coli], "[^\u0001-\u007F]+" ) if(result == TRUE){ uuid <- c(uuid, as.character(raw_data[rowi, "_uuid"])) question <- (c(question, names(raw_data[coli]))) original_value <- c(original_value, as.character(raw_data[rowi, coli])) } } # Feedback cat("\014") print (paste("Logging Column", coli, "of", length(raw_data))) } – Najla Naz Dec 24 '19 at 04:51
-
Sorry, I can't see what you are trying to communicate to me. Did the two answers you have received so far *not* answer your question? If so, can you specify in clear terms what is missing? – Chris Ruehlemann Dec 24 '19 at 07:39