2

This question actually has two parts, whether removing foreign words is necessary and what's the best way to realise it.

I'm a beginner trying to extract topics from English food review basically using latent dirchlet allocation in Python. The output is 5 topics each with 50 words each, and I have used NLTK to remove English stopwords. But one (and only one) topic contains many foreign words that might not bear meanings, like "de" "la" "et" "les".

Some original reviews that contain these words:

-A la carte sushi is great. Pot of soup is huge and delicious. -I would be interested in returning to try their Anticuchos, Ceviche de Mixto, Cau Cau, Aji de Gallina, and Chaufa de Camaron. -I recommend patients in the parking lot. I would be lying if I didn't admit its some of the finest que in the country!

The next step is get user vector, item vector and train, test, validate the results.

Are these words meaningful, or shall they be removed?

And how to remove the words?

One answer in the question below suggests using NLKT set of English words, but I found the words set quite small, and words like "de" "un" cannot be removed.

words = set(nltk.corpus.words.words())
len(words) #235892

Another method suggest python package enchanted but it's not maintained anymore.

Removing non-English words from text using Python

The topic results I got are:

pizza burger cheese de good place crust sauce burgers order et service toppings pizzas like la fresh le thin restaurant un slice best great delivery time pour poutine delicious garlic menu try pepperoni est taste back les sandwich meat food better style fast plus minutes que little pie onion pas

  • Natural language processing is never-ending problem in IT. There some solution however. I am able to give you two hints: forget about removing it by words. Use set of worlds. There are single words that have meaning in many languages, that's why you should not remove `a` but try to find `a la carte`. Second hint is for you to profile your texts and select only words which occurs the most. But those might not be the best solutions :) – Laszlowaty Apr 08 '19 at 12:44

0 Answers0