Getting wrong answer byy langdetect.detect

Question

I am using both Nltk and Scikit Learn to do some text processing. I have a data set containing of sentences that some of them has explained the situation in French and English(French part is duplicated) which I want to delete french part. Following in one of my sentence:

"quipage de Global Express en provenance deTokyo Japon vers Dorval a d effectuer une remise des gaz sur la piste cause d un probl me de volets Il fut autoris se poser sur la piste Les services d urgence n ont pas t demand s appareil s est pos sans encombre D lai d environ minutes sur l exploitation The crew of Global Express from Tokyo Japan to Dorval had to pull up on Rwy at because of a flap problem It was cleared to land on Rwy Emergency services were not requested The aircraft touched down without incident Delay of about minutes to operations Regional Report of m d y with record s "

I want to remove all words that are in French. I have tried following code so far but the result is not good enough.

x=sentence
x=x.split()
import langdetect      
from langdetect import detect 
for word in x:
lang=langdetect.detect(word)
if lang=='fr':
    print(word)
    x.remove(word)

the following is my output:

l
un
sur
une
oiseaux
avoir
un
le
du
un
est

Is this a good approach? how I can improve its performance in order to reach better results.

score 1 · Answer 1 · answered Dec 04 '18 at 19:20

1

Language detection usually requires at least a longer sentence to do a decent job. One or two short words is probably not going to be enough. Think about a in Dorval a d effectuer above. Is a by itself French or English? Is Tokyo French?

I'd also double-check whether this library can handle the kind of non-standard French (no accents, no apostrophes, missing letters, etc.) that you have in your data by checking to see what the library detects for longer strings. It's possible the library is only good at figuring out that more standard French is French. For example, d'un problème vs. your data: d un probl me.

See also this question for other approaches where you can restrict the possible set of languages: Python langdetect: choose between one language or the other only

answered Dec 04 '18 at 19:20

aab

10,858
22
38

Thanks for your response. In fact French words had accents and apostrophes, I removed it since with them the following error had been appeared. '''No features in text''. I came across the following code but it did not work. def unusual_words(text): text_vocab = set(w.lower() for w in text if w.isalpha()) english_vocab = set(w.lower() for w in nltk.corpus.words.words()) unusual = text_vocab.difference(english_vocab) return sorted(unusual) – homa mohammadpour sadigh Dec 05 '18 at 17:07
If your goal is language detection, it's probably better not to do that kind of preprocessing. The accents are very good clues about French vs. English. – aab Dec 06 '18 at 08:44
You are right. I would definitely check that. Thanks – homa mohammadpour sadigh Dec 06 '18 at 16:50

Getting wrong answer byy langdetect.detect

1 Answers1