This question actually has two parts, whether removing foreign words is necessary and what's the best way to realise it.
I'm a beginner trying to extract topics from English food review basically using latent dirchlet allocation in Python. The output is 5 topics each with 50 words each, and I have used NLTK to remove English stopwords. But one (and only one) topic contains many foreign words that might not bear meanings, like "de" "la" "et" "les".
Some original reviews that contain these words:
-A la carte sushi is great. Pot of soup is huge and delicious.
-I would be interested in returning to try their Anticuchos, Ceviche de Mixto, Cau Cau, Aji de Gallina, and Chaufa de Camaron.
-I recommend patients in the parking lot. I would be lying if I didn't admit its some of the finest que in the country!
The next step is get user vector, item vector and train, test, validate the results.
Are these words meaningful, or shall they be removed?
And how to remove the words?
One answer in the question below suggests using NLKT set of English words, but I found the words set quite small, and words like "de" "un" cannot be removed.
words = set(nltk.corpus.words.words())
len(words) #235892
Another method suggest python package enchanted but it's not maintained anymore.
Removing non-English words from text using Python
The topic results I got are:
pizza burger cheese de good place crust sauce burgers order et service toppings pizzas like la fresh le thin restaurant un slice best great delivery time pour poutine delicious garlic menu try pepperoni est taste back les sandwich meat food better style fast plus minutes que little pie onion pas