0

I am a newbie to machine learning. What I currently want is to classify whether some words comes under a category or not..

Let me be more specific, On inputting some words I need to check whether those words comes under a language known as "Malayalam".

Example: enthayi ninakk sugamanno?

These are some malayalam words which are expressed in english. On giving some input like this, it need to check the trained data and if any of the input words comes under the category 'Malayalam' then it needs to display that it's Malayalam.

What I've tried to do..

I tried to classify it with a NaiveBayesClassifier, but it always shows a positive response for all the input data.

train = [
('aliya','Malayalam')]
cl = NaiveBayesClassifier(train)
print cl.classify('enthayi ninakk sugamanno')

But the print statement gives an output 'Malayalam'

Ajay Victor
  • 153
  • 1
  • 1
  • 12

1 Answers1

3

You need both positive and negative data to train a classifier. It wouldn't be hard to add a bunch of English text, or whatever the likely alternatives are in your domain. But you need to read up on how an nltk classifier actually works, or you'll only be able to handle words that you've seen in your training data: You need to select and extract "features" that the classifier will use to do its job.

So (from the comments) you want to categorize individual words as being Malayalam or not. If your "features" are whole words, you are wasting your time with a classifier; just make a Python set() of Malayalam words, and check if your inputs are in it. To go the classifier route, you'll have to figure out what makes a word "look" Malayalam to you (endings? length? syllable structure?) and manually turn these properties into features so that the classifier can decide how important they are.

A better approach for language detection is to use letter trigrams: Every language has a different "profile" of common and uncommon trigrams. You can google around for it, or code your own. I had good results with "cosine similarity" as a measure of distance between the sample text and the reference data. In this question you'll see how to calculate cosine similarity, but for unigram counts; use trigrams for language identification.

Two benefits of the trigram approach: You are not dependent on familiar words, or on coming up with clever features, and you can apply it to stretches of text longer than a single word (even after filtering out English), which will give you more reliable results. The nltk's langid corpus provides trigram counts for hundreds of common languages, but it's also easy enough to compile your own statistics. (See also nltk.util.trigrams().)

alexis
  • 48,685
  • 16
  • 101
  • 161
  • I just added some alternatives but still the fact is that the system shows the same output as Malayalam, even if the input data is different. – Ajay Victor Sep 24 '17 at 19:41
  • I suggest you read the documentation. You are not initializing your classifier correctly, I'm surprised it even runs. You should just create it without arguments (`cl = NaiveBayesClassifier()`), then train it with `cl.train(data)` with data in the *appropriate* format. Where did you see the set-up you are using? – alexis Sep 24 '17 at 20:41
  • http://stevenloria.com/how-to-build-a-text-classification-system-with-python-and-textblob/ From here I got the syntax... – Ajay Victor Sep 26 '17 at 12:22
  • Oh that's the `textblob` interface! I've no idea what textblob does w that input-- that's what you get for leaving the `import` statements out of the code in your question. But it classifies phrases, not words, and from that link it looks like it's doing straight bag-of-words classification, i.e. dictionary lookup. If it's not working for you, use the `nltk` directly and read the docs. Or (my advice) use letter trigrams. – alexis Sep 26 '17 at 14:35
  • Can you explain a simple example for the naive bayes algorithm with the train and test data.. – Ajay Victor Sep 28 '17 at 15:11
  • It's all in the nltk book, take a look and come back with a simple question. Or look at the code in [this question](https://stackoverflow.com/questions/20827741/nltk-naivebayesclassifier-training-for-sentiment-analysis), but be warned that the answer is pretty limited compared to the kind of features you can actually include. (It only does bag-of-words classification, like `textblob`.) Read the book. – alexis Sep 28 '17 at 16:10
  • But nltk langid can't be used here I think, as the language is mixed with English, that is some words are only being added in english.. Like a user may comment "suganno" or "sugamanno" which means are you fine? can this be categorized using nltk's langid corpus? – Ajay Victor Sep 30 '17 at 04:22
  • You haven't said what you want to do in cases of mixed text. Do you _want_ your mixed example to be tagged Malayalam, or don't you? Actually you haven't said _anything_ about the specifics of your task. How large _are_ the units of classification (single sentences or tweets? paragraphs? documents...), and what do you want to do? Identify the major language? Detect any use of Malayalam, even single words in a paragraph of English? Without these details there is no answer to your question (and no question, really). – alexis Sep 30 '17 at 10:49
  • The input is in the form of xml as a paragraph which was actually taken from facebook, which had been split into words, that is after eliminating the english words, other words like this mixed text will be getting as a list (I just used a spell checker and it returns all the mis-spelled words). What I want to simply check is whether the taken mixed text words are malayalam or not, that is whether the author has used any malayalam in that paragraph or not. – Ajay Victor Sep 30 '17 at 11:43
  • So you want to check individual words and report if even one of them _could_ be malayalam. You'd get more reliable results from working with the entire paragraph (or the entire set of words minus the English, if you prefer). But it's your project. You can still do trigram analysis in single words, and nothing stops you from doing word lookup in a Malayalam dictionary and _then_ doing a trigram analysis of words that were not in the dictionary. – alexis Sep 30 '17 at 11:57
  • If you go the classifier route, you'll have to figure out what makes a malayalam word (endings? length? syllable structure?) and manually turn these properties into features so that the classifier can decide how important they are. If your "features" are whole words, you are wasting your time with a classifier; just use a dictionary. – alexis Sep 30 '17 at 11:58
  • Actually in malayalam for "eda" it is എടാ.. So In Malayalam dictionary this word ('eda') is not present where as എടാ is present, that is making me confusion that whether I can use a trigram or not. – Ajay Victor Sep 30 '17 at 13:49
  • You have to create your own Python dict, or trigram table, based on romanized Malayalam text... this discussion is not getting anywhere. – alexis Sep 30 '17 at 15:31