1

I am able to do some simple machine learning using scikit-learn and NLTK modules in Python. But I have problems when it comes to training with multiple features that have different value types (number, list of string, yes/no, etc). In the following data, I have a word/phrase column in which I extract the information and create relevant columns (for example, the length column is the character lengths of 'word/phrase'). Label column is the label.

Word/phrase Length  '2-letter substring'    'First letter'  'With space?'       Label
take action 10  ['ta', 'ak', 'ke', 'ac', 'ct', 'ti', 'io', 'on']    t   Yes     A
sure    4   ['su', 'ur', 're']  s   No      A
That wasn't     10  ['th', 'ha', 'at', 'wa', 'as', 'sn', 'nt']  t   Yes     B
simply  6   ['si', 'im', 'mp', 'pl', 'ly']  s   No      C
a lot of    6   ['lo', 'ot', 'of']  a   Yes     D
said    4   ['sa', 'ai', 'id']  s   No      B

Should I make them into one dictionary and then use sklearn's DictVectorizer to hold them in a working memory? And then treat these features as one X vector when training the ML algorithms?

Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
KubiK888
  • 4,377
  • 14
  • 61
  • 115
  • did you solve this problem? i have a very similar one and still try to find a solution. I also wanted to convert my features into an dictionary – dumbchild Oct 14 '20 at 09:26

1 Answers1

3

Majority of machine learning algorithms work with numbers, so you can to transform your categorical values and string into numbers.

Popular python machine-learning library scikit-learn has the whole chapter dedicated to preprocessing of the data. With 'yes/no' everything is easy - just put 0/1 instead of it.

Among many other important things it explains the process of categorical data preprocessing using their OneHotEncoder.

When you work with text, you also have to transform your data in a suitable way. One of the common feature extraction strategy for text is a tf-idf score, and I wrote a tutorial here.

Community
  • 1
  • 1
Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
  • I have actually read the scikit-learn tutorial, but still left a bit confused of how to deal with my data. The tutorial example I have done with the OneHotEncoder is that each row contains a single value of textual category for a column, but for my substring column, each row contains multiple textual entities, can I still use the OneHotEncoder to transform these substrings? – KubiK888 Sep 21 '15 at 03:25
  • And since most of the tutorial examples I have gone through only dealt with one or two features, I would like to know after I transform and standardize all the features, do I group all features into a dictionary and see it as a single X vector before plugging into the training algorithm? Is creating this big dictionary efficient or does it slow down the computational speed? – KubiK888 Sep 21 '15 at 03:28
  • @KubiK888 Yes, you can use OneHotEncoder. But the real answer heavily depends what do you want to do with your data afterwards. For example sometimes you can get along without any transformation. Take a look at [this example](http://stackoverflow.com/a/32646528/1090562) that uses `nltk.NaiveBayesClassifier`. I recommend you to take a step back from this question. Decide what do you want to do with your data, try to do with using scikit learn and if you have problems write a new question explaining what you want to do, what you have done and what you do not like. – Salvador Dali Sep 21 '15 at 03:33