-1

I have an excel sheet with 2 columns:

  1. Words 2. Language

There is only one word on each row and it is directly linked to a language

How would I format those words and languages into machine learning acceptable data?

I'm using scikit-learn and thought about bag of words but it seemed to me that indexation of every word wouldn't convey the characteristics of each word.

JrProgrammer
  • 108
  • 1
  • 11

1 Answers1

2

From your question, I think you are asking about how to extract features from words to be used to train a classifier for determining the language of the words. I think the length of the word and the character bigrams in the word are good features to start with. Take a look at this post for extracting character bigrams. In addition, maybe it is suitable to use the NLTK classifiers. For example,

from nltk.classify import NaiveBayesClassifier
nb = NaiveBayesClassifier.train(train_set)

where train_set should be a list of tuples of the form [(features, label)], where features is a dict of the form {feature_name: feature_value}.

Community
  • 1
  • 1
Hossein
  • 2,041
  • 1
  • 16
  • 29
  • This works, Thanks! Do you also know how to implement this in scikit-learn? Or should these types of classification problems only be done by nltk? – JrProgrammer Apr 10 '17 at 19:21
  • @JesseVermeulen You can also use scikit-learn. Look at [this link](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB). – Hossein Apr 10 '17 at 19:35
  • Thanks fo the help! – JrProgrammer Apr 10 '17 at 19:37