I am using both Nltk and Scikit Learn to do some text processing. I have a data set containing of sentences that some of them has explained the situation in French and English(French part is duplicated) which I want to delete french part. Following in one of my sentence:
"quipage de Global Express en provenance deTokyo Japon vers Dorval a d effectuer une remise des gaz sur la piste cause d un probl me de volets Il fut autoris se poser sur la piste Les services d urgence n ont pas t demand s appareil s est pos sans encombre D lai d environ minutes sur l exploitation The crew of Global Express from Tokyo Japan to Dorval had to pull up on Rwy at because of a flap problem It was cleared to land on Rwy Emergency services were not requested The aircraft touched down without incident Delay of about minutes to operations Regional Report of m d y with record s "
I want to remove all words that are in French. I have tried following code so far but the result is not good enough.
x=sentence
x=x.split()
import langdetect
from langdetect import detect
for word in x:
lang=langdetect.detect(word)
if lang=='fr':
print(word)
x.remove(word)
the following is my output:
l
un
sur
une
oiseaux
avoir
un
le
du
un
est
Is this a good approach? how I can improve its performance in order to reach better results.