Get rid all the non-english character in r

Question

I have a datafile, there is some French, Japanese inside, the data file is looks as following:

we have two columns:

Col1 contains a sentences, most of them are in English, and some of them is in foreign language.
Col2 is all English.

Col1 is kindly looks like:

| _ - 5 | PR - The number of qualified candidates
| _ - 6 | PR - アルバイト募集を掲載していますが、応募者がほとんどいないため。
| _ - 8 | PR - Quick, easy, inexpensive and plenty of applicants

What I do is to only keep English. If we find a Foreign language word in one row. I need to delete the whole row.

Is anyone know how to do it in R?

Make an effort and try to [make your question reproducible](http://stackoverflow.com/q/5963269/1315767) — Jilber Urbina, Jun 18 '14 at 21:13
Very Interesting question bad badly asked. but I don't understand why it is so downvoted. Should only add an example. Op, can you please add some data ? — agstudy, Jun 18 '14 at 21:17
@asb kindly like delete all non-ascii. But if one line is in English, and contains one non-ascii character, don't delete it — user3754216, Jun 18 '14 at 21:30
[Somewhat related](http://stackoverflow.com/questions/9934856/removing-non-ascii-characters-from-data-files/9935242#9935242) — Josh O'Brien, Jun 18 '14 at 21:31

score 1 · Accepted Answer · answered Jun 18 '14 at 21:30

Maybe you can use textcat package that claims it can detect more than 74 languages.( It don't work with arabic :()

library("textcat")
dat <- read.table(text='
| _ - 5 | PR - The number of qualified candidates
| _ - 6 | PR - アルバイト募集を掲載していますが、応募者がほとんどいないため。
| _ - 8 | PR - Quick, easy, inexpensive and plenty of applicants' ,sep='|')

dat[textcat(dat$V3) =="english",]

 V1      V2                                                      V3
1 NA  _ - 5                  PR - The number of qualified candidates
3 NA  _ - 8   PR - Quick, easy, inexpensive and plenty of applicants

Get rid all the non-english character in r

1 Answers1