Encoding unique features

Question

I have an excel sheet with 2 columns:

Words 2. Language

There is only one word on each row and it is directly linked to a language

How would I format those words and languages into machine learning acceptable data?

I'm using scikit-learn and thought about bag of words but it seemed to me that indexation of every word wouldn't convey the characteristics of each word.

What is your classification task? What do you want to be the input and output of the trained system? — Hossein, Apr 10 '17 at 18:16
@Hossein The task would be to classify a given word as either english or dutch. — JrProgrammer, Apr 10 '17 at 18:26

score 2 · Accepted Answer · edited May 23 '17 at 10:30

2

From your question, I think you are asking about how to extract features from words to be used to train a classifier for determining the language of the words. I think the length of the word and the character bigrams in the word are good features to start with. Take a look at this post for extracting character bigrams. In addition, maybe it is suitable to use the NLTK classifiers. For example,

from nltk.classify import NaiveBayesClassifier
nb = NaiveBayesClassifier.train(train_set)

where train_set should be a list of tuples of the form [(features, label)], where features is a dict of the form {feature_name: feature_value}.

edited May 23 '17 at 10:30

Community

1
1

answered Apr 10 '17 at 18:43

Hossein

2,041
1
16
29

This works, Thanks! Do you also know how to implement this in scikit-learn? Or should these types of classification problems only be done by nltk? – JrProgrammer Apr 10 '17 at 19:21
@JesseVermeulen You can also use scikit-learn. Look at [this link](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB). – Hossein Apr 10 '17 at 19:35
Thanks fo the help! – JrProgrammer Apr 10 '17 at 19:37

Encoding unique features

1 Answers1