Collecting Machine learning training data

Question

I am very new to machine learning, and need a couple of things clarified. I am trying to predict the probability of someone liking an activity based on their Facebook likes. I am using the Naive Bayes classifier, but am unsure on a couple of things. 1. What would my labels/inputs be? 2. What info do I need to collect for training data? My guess is create a survey and have questions on wether the person would enjoy an activity (Scale from 1-10)

score 2 · Accepted Answer · answered Feb 08 '17 at 13:20

In supervised classification, all classifiers need to be trained with known labeled data, this data is known as training data. Your data should have a vector of features followed by a special one called class. In your problem, if the person has enjoyed the activity or not.

Once you train the classifier, you should test it's behavior with another dataset in order not to be biased. This dataset must have the class as the train data. If you train and test with the same datasets your classifiers prediction may be really nice but unfair.

I suggest you to take a look to evaluation techniques like K Fold Cross Validation.

Another thing you should know is that the common Naïve Bayes classifier is used to predict binary data, so your class should be 0 or 1 meaning that the person you make a survey enjoyed or not the activity. Also it's implemented in packages like Weka (Java) or SkLearn (Python).

If you are really interested in Bayesian Classifiers I need to say that in fact, Naïve Bayes for binary classification is not the best one because Minsky in 1961 discovered that the decision boundaries are hyperplanes. Also the Brier Score is really bad and it is say that this classifier is not well calibrated. But, it make good predictions after all.

Hope it helps.

What would the features be? I'm still a little confused on how the data will be structured. — joethemow, Feb 09 '17 at 01:53
If you are looking for one case, formally called *individual*, it's should look like: 2,2,7,10,0, **1**. This means: the first question answer is 2, the second question 2, and so on. The last number (in bold) corresponds to the *class* feature which as is 1, it means that the person is satisfied with the activity. Note that you are not using just once instance, instead, you are having a matrix, each row corresponds to an instance. — ancalotoru, Feb 09 '17 at 09:42

score 0 · Answer 2 · answered Feb 08 '17 at 01:11

0

This may be fairly difficult with Naive Bayes. You'll need to collect (or calculate) samples of whether or not a person likes activity X, and also details on their Facebook likes (organized in some consistent way).

Basically, for Naive Bayes, your training data should be the same data type as your testing data.

The survey approach may work, if you have access to each person's Facebook like history.

answered Feb 08 '17 at 01:11

igoldthwaite

289
1
8

Is there another classifier that could make this easier? – joethemow Feb 08 '17 at 01:15
Ideally say I did have access to the person's Facebook likes, I'm still a little confused on how to set up the training input stage – joethemow Feb 08 '17 at 01:16
I would look into understand Bayes theorem / Bayes rule to get a solid understanding of how to train from your data. http://stackoverflow.com/a/20556654/7531811 does a great job outlining this! – igoldthwaite Feb 08 '17 at 03:43
A strong understanding of the conditional probability and Bayes rules that are a part of Naive Bayes is definitely important to understanding how to train and test using this method. – igoldthwaite Feb 08 '17 at 03:45

Collecting Machine learning training data

2 Answers2