1

I have a datafile, there is some French, Japanese inside, the data file is looks as following:

we have two columns:

  • Col1 contains a sentences, most of them are in English, and some of them is in foreign language.
  • Col2 is all English.

Col1 is kindly looks like:

| _ - 5 | PR - The number of qualified candidates
| _ - 6 | PR - アルバイト募集を掲載していますが、応募者がほとんどいないため。
| _ - 8 | PR - Quick, easy, inexpensive and plenty of applicants 

What I do is to only keep English. If we find a Foreign language word in one row. I need to delete the whole row.

Is anyone know how to do it in R?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
user3754216
  • 107
  • 1
  • 1
  • 10

1 Answers1

1

Maybe you can use textcat package that claims it can detect more than 74 languages.( It don't work with arabic :()

library("textcat")
dat <- read.table(text='
| _ - 5 | PR - The number of qualified candidates
| _ - 6 | PR - アルバイト募集を掲載していますが、応募者がほとんどいないため。
| _ - 8 | PR - Quick, easy, inexpensive and plenty of applicants' ,sep='|')

dat[textcat(dat$V3) =="english",]

 V1      V2                                                      V3
1 NA  _ - 5                  PR - The number of qualified candidates
3 NA  _ - 8   PR - Quick, easy, inexpensive and plenty of applicants
agstudy
  • 119,832
  • 17
  • 199
  • 261