0

I have a raw data (csv file) which I loaded to R studio and I want to extract those columns which have non english data in it how could possibly I can do that?

Najla Naz
  • 1
  • 1
  • https://stackoverflow.com/questions/34613761/detect-non-ascii-characters-in-a-string – novica Dec 22 '19 at 05:59
  • Does this answer your question? [detect non ascii characters in a string](https://stackoverflow.com/questions/34613761/detect-non-ascii-characters-in-a-string). If not, can you please add a sample data and clarify your question? – DJV Dec 22 '19 at 06:02
  • @DJV yes like I have a csv file with columns name, last name, city, address in these columns suppose city might have names like: New York, Delhi, کابل, تهران so I want to find those values which are not in English with their column name – Najla Naz Dec 22 '19 at 06:13
  • Came across this by chance: `grepl("[^ -~]", x)`. This matches any non-ASCII character; for more info check out http://www.catonmat.net/blog/my-favorite-regex/ – Chris Ruehlemann Dec 22 '19 at 10:12
  • 1
    For example: `x <- c("New York", "Delhi", "کابلت هران", "ü", "ß") grepl("[^ -~]", x) [1] FALSE FALSE TRUE TRUE TRUE` – Chris Ruehlemann Dec 22 '19 at 10:19

2 Answers2

0

You can use the stringi package.

I will use @Chris Ruehlemann sample data.

Sample data:

x <- c("New York", "Delhi", "کابلت هران", "ü", "ß") 

library(stringi)

grepl("[[:cntrl:]]", stringi::stri_enc_toascii(x))

Which will give you a TRUE/FALSE output:

[1] FALSE FALSE  TRUE  TRUE  TRUE

Next, you can extract the non English values:

x[grepl("[[:cntrl:]]", stringi::stri_enc_toascii(x))]
[1] "کابلت هران" "ü"          "ß" 

If you'd like you can also use the base R solution:

grepl("[^ -~]", x)

Which will give you similar results:

[1] FALSE FALSE  TRUE  TRUE  TRUE

However, if you'll benchmark both function, stringi::stri_enc_toascii is faster.

library(rbenchmark)

# sample data
y <- sample(x, 1000, replace = TRUE)

benchmark(
  "stringi" = {
    grepl("[[:cntrl:]]", stringi::stri_enc_toascii(y))
    },
"baseR" = {
  grepl("[^ -~]", y)
},
replications = 1000,
columns = c("test", "replications", "elapsed",
            "relative", "user.self", "sys.self"))

     test replications elapsed relative user.self sys.self
2   baseR         1000    0.96    5.053      0.96     0.00
1 stringi         1000    0.19    1.000      0.19     0.01
DJV
  • 4,743
  • 3
  • 19
  • 34
0

Since you say that you want to select columns in your data frame whose names have the non-ASCII characters, here's a simple R base solution for that. Let's assume your data has this structure:

df <- data.frame(
  NY = c("some", "data", "more", "data"),
  Delhi = c("some", "data", "more", "data"),
  کابل = c("some", "data", "more", "data"),
  ß = c("some", "data", "more", "data")
)
df
    NY Delhi کابل    ß
1 some  some some some
2 data  data data data
3 more  more more more
4 data  data data data

Then all you have to do is subset dfon the columns of interest using grepl to find matches of the pattern [^ -~] in colnames:

df[ , grepl("[^ -~]", colnames(df))]
  کابل    ß
1 some some
2 data data
3 more more
4 data data
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • Thank you but I said data in a column I work that as below but it takes a little time to execute – Najla Naz Dec 24 '19 at 04:49
  • raw_data <- read_excel("./input/Extracted AFG1904_Emergency SNFI & Winterisation Assessment - all versions - False - 2019-12-23-04-50-46.xlsx") #Creates an empty data frame new_df <- data.frame(UUID = character(0), question_id = character(0), original_value = character(0)) – Najla Naz Dec 24 '19 at 04:50
  • #Creates variables for to_be_translated data frame uuid <- vector() question <- vector() original_value <- vector() #Looking for un-translated values – Najla Naz Dec 24 '19 at 04:51
  • for(coli in 1:ncol(raw_data)){ for(rowi in 1:nrow(raw_data)){ result <- grepl(raw_data[rowi,coli], "[^\u0001-\u007F]+" ) if(result == TRUE){ uuid <- c(uuid, as.character(raw_data[rowi, "_uuid"])) question <- (c(question, names(raw_data[coli]))) original_value <- c(original_value, as.character(raw_data[rowi, coli])) } } # Feedback cat("\014") print (paste("Logging Column", coli, "of", length(raw_data))) } – Najla Naz Dec 24 '19 at 04:51
  • Sorry, I can't see what you are trying to communicate to me. Did the two answers you have received so far *not* answer your question? If so, can you specify in clear terms what is missing? – Chris Ruehlemann Dec 24 '19 at 07:39