4

I have 9164 points, where 4303 are labeled as the class I want to predict and 4861 are labeled as not that class. They are no duplicate points.

Following How to split into train, test and evaluation sets in sklearn?, and since my dataset is a tuple of 3 items (id, vector, label), I do:

df = pd.DataFrame(dataset)
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

train_labels = construct_labels(train)
train_data = construct_data(train)

test_labels = construct_labels(test)
test_data = construct_data(test)

def predict_labels(test_data, classifier):
    labels = []
    for test_d in test_data:
        labels.append(classifier.predict([test_d]))
    return np.array(labels)

def construct_labels(df):
    labels = []
    for index, row in df.iterrows():
        if row[2] == 'Trump':
            labels.append('Atomium')
        else:
            labels.append('Not Trump')
    return np.array(labels)

def construct_data(df):
    first_row = df.iloc[0]
    data = np.array([first_row[1]])
    for index, row in df.iterrows():
        if first_row[0] != row[0]:
            data = np.concatenate((data, np.array([row[1]])), axis=0)
    return data

and then:

>>> classifier = SVC(verbose=True)
>>> classifier.fit(train_data, train_labels)
[LibSVM].......*..*
optimization finished, #iter = 9565
obj = -2718.376533, rho = 0.132062
nSV = 5497, nBSV = 2550
Total nSV = 5497
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=True)
>>> predicted_labels = predict_labels(test_data, classifier)
>>> for p, t in zip(predicted_labels, test_labels):
...     if p == t:
...             correct = correct + 1

and I get correct only 943 labels out of 1833 (=len(test_labels)) -> (943*100/1843 = 51.4%)


I am suspecting I am missing something big time here, maybe I should set a parameter to the classifier to do more refined work or something?

Note: First time using SVMs here, so anything you might get for granted, I might have not even imagine...


Attempt:

I went ahed and decreased the number of negative examples to 4303 (same number as positive examples). This slightly improved accuracy.


Edit after the answer:

>>> print(clf.best_estimator_)
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
>>> classifier = SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
...   decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
...   max_iter=-1, probability=False, random_state=None, shrinking=True,
...   tol=0.001, verbose=False)
>>> classifier.fit(train_data, train_labels)
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.0001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Also I tried clf.fit(train_data, train_labels), which performed the same.


Edit with data (the data are not random):

>>> train_data[0]
array([  20.21062112,   27.924016  ,  137.13815308,  130.97432804,
        ... # there are 256 coordinates in total
         67.76352596,   56.67798138,  104.89566517,   10.02616417])
>>> train_labels[0]
'Not Trump'
>>> train_labels[1]
'Trump'
Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • SVMs need parameter-tuning which is very important (especially nonlinear kernels). You don't seem to tune these. It's also very important to standardize your data (mean and variance). Use scikit-learns *GridSearchCV* to automatically tune these with cross-validation. – sascha Sep 17 '16 at 20:20
  • @sascha could you please provide an example or something more? I am really a newbie here! And what you say sounds really right! – gsamaras Sep 17 '16 at 20:24
  • Just read scikit-learns [user-guide](http://scikit-learn.org/stable/modules/svm.html). These are very elemental steps and i'm puzzled why people are using such a theoretically complex concept like SVMs without even reading about the basic usage-rules. [Heres a GridSearch example](http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html) which also shows how important parameter-tuning is -> accuracies between ~ [0.2, 0.95] – sascha Sep 17 '16 at 20:29
  • @sascha I read the documentation of SVMs before starting experimenting, but I didn't find what you say, sorry! Anyway, I just applied the answer, and got 49.8% accuracy, with my attempt, that's not better at all....If you have any input on that, please let me know. – gsamaras Sep 17 '16 at 21:03
  • If you did that, then show the code. It's very very likely (~100%) that trying other parameters will improve the score. You should also read some basic ML-course. – sascha Sep 17 '16 at 21:06
  • Yes @sascha good idea, I was about to do so... – gsamaras Sep 17 '16 at 21:09
  • Your edit also shows that you are not used to using sklearn. There are very good user-guides. Your edit would look much different if you had read one. Something different: I don't know your data and your data-processing is kind of ugly. You should show the first rows of train_data and train_labels. It's possible that you did a huge mistake there too. – sascha Sep 17 '16 at 21:13
  • Have you tried a simpler model (e.g. Naive Bayes) on your data? That would be a lot faster and would give you some lower bound to beat with a more complex model. Otherwise you might not even know whether the weeks and weeks of parameter tuning on the SVM are actually paying off. – tttthomasssss Sep 17 '16 at 21:21
  • If the data is not random and the preprocessing of the data is correct, not improving the score with grid-search has a chance of approximately zero percent. He should check his inputs to the classifiers and he also did not show the usage of gridsearch, only some output. While SVM might not be the best approach here (we don't know as we don't know the data), the chance he is doing something very wrong is high. – sascha Sep 17 '16 at 21:24
  • @tttthomasssss no I haven't, the goal is to do SVMs only..sascha what do you mean? I told you to expand with an example on how to use that, and you didn't post an answer with that. How should I do it? Please post an answer if you think that how I did it is wrong, which I hope it is! – gsamaras Sep 17 '16 at 21:28
  • I'm not doing your work :-). I only give hints. I told you about some **very very basic rules** including standardization of data. Your output of train_data rows show, that you ignored that. I also expected someone with such a high SO-score to know how to ask questions. The ```Minimal, Complete, and Verifiable example``` should ring a bell. This is especially important in ML, as it's always about the data. Why showing you how to tune some digit-recognization example (one of sklearns basic datasets) when your data is unlearnable (which is theoretically possible). – sascha Sep 17 '16 at 21:36
  • @sascha this is the closest I could get to a minimal example, thank you. Also, please notice that I didn't ask you to do my work, I asked for help. I posted an answer with what I did eventually, would you like to take a look please, and point me to anything silly I may have done? :/ – gsamaras Sep 20 '16 at 00:47
  • Yes, i did only see the comment, not your answer. I already deleted my comment. You seem to get a nice accuracy now. I guarantee you, that the most important step was **standardizing** which i told you about in the first comment. I don't know why you calculate your score by hand. There should be a valid loss available in sklearn. – sascha Sep 20 '16 at 00:56
  • @sascha thank you for taking the time to look into this, and for all the great comments so far. So, it is fair that I `scaler.fit_transform(test_data)`, right? Because, without it, accuracy would drop dramatically. – gsamaras Sep 20 '16 at 00:58
  • Yes. Like with NNs, PCA and some other stuff, standardizing is the most important step for some classifiers (the SVM user-guide in sklearn also talks about it). Sometimes not doing it hurts the model (svm), sometimes it hurts the learning-process (NNs). But please take these sentences with a grain of salt. One cosmetic improvement: gridsearchCV automatically keeps the best classifer internally. You don't need to copy/imitate the parameters. Maybe you saw that already, but some earlier edit did not look like it. Look at the attribute ```best_estimator_```. – sascha Sep 20 '16 at 01:01

2 Answers2

11

Most estimators in scikit-learn such as SVC are initiated with a number of input parameters, also known as hyper parameters. Depending on your data, you will have to figure out what to pass as inputs to the estimator during initialization. If you look at the SVC documentation in scikit-learn, you see that it can be initialized using several different input parameters.

For simplicity, let's consider kernel which can be 'rbf' or ‘linear’ (among a few other choices); and C which is a penalty parameter, and you want to try values 0.01, 0.1, 1, 10, 100 for C. That will lead to 10 different possible models to create and evaluate.

One simple solution is to write two nested for-loops one for kernel and the other for C and create the 10 possible models and see which one is the best model amongst others. However, if you have several hyper parameters to tune, then you have to write several nested for loops which can be tedious.

Luckily, scikit learn has a better way to create different models based on different combinations of values for your hyper model and choose the best one. For that, you use GridSearchCV. GridSearchCV is initialized using two things: an instance of an estimator, and a dictionary of hyper parameters and the desired values to examine. It will then run and create all possible models given the choices of hyperparameters and finds the best one, hence you need not to write any nested for-loops. Here is an example:

from sklearn.grid_search import GridSearchCV
print("Fitting the classifier to the training set")
param_grid = {'C': [0.01, 0.1, 1, 10, 100], 'kernel': ['rbf', 'linear']}
clf = GridSearchCV(SVC(class_weight='balanced'), param_grid)
clf = clf.fit(train_data, train_labels)
print("Best estimator found by grid search:")
print(clf.best_estimator_)

You will need to use something similar to this example, and play with different hyperparameters. If you have a good variety of values for your hyperparameters, there is a very good chance you will find a much better model this way.

It is however possible for GridSearchCV to take a very long time to create all these models to find the best one. A more practical approach is to use RandomizedSearchCV instead, which creates a subset of all possible models (using the hyperparameters) at random. It should run much faster if you have a lot of hyperparameters, and its best model is usually pretty good.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
happyhuman
  • 1,541
  • 1
  • 16
  • 30
  • The answer is lacking some explanations but it should help the OP. Just some remarks: GridSearchCV does not try all possible models, but the cartesian-product (ordered combinations) of the parameter-candidates defines by shahins here. Also: C and gamma are the most important params when using SVM + rbf-kernel, so these are tuned here (there are more). Another remark: the concept of cross-validation is independent from SVMs. This is something which should be read about too (there are also many parameters/concepts). – sascha Sep 17 '16 at 20:39
  • shashins, what are you fitting there? Didn't you mean `clf.fit(train_data, train_labels)`? I just applied the answer, and got 49.8% accuracy, with my attempt, that's not better at all.. :/ Any other thought? I also think that I might doing something silly still, check my updated question please! :) – gsamaras Sep 17 '16 at 21:10
  • I added more explanation in my answer. I hope it helps. – happyhuman Sep 17 '16 at 22:03
4

After the comments of sascha and the answer of shahins, I did this eventually:

df = pd.DataFrame(dataset)
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

train_labels = construct_labels(train)
train_data = construct_data(train)

test_labels = construct_labels(test)
test_data = construct_data(test)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_data = scaler.fit_transform(train_data)

from sklearn.svm import SVC
# Classifier found with shahins' answer
classifier = SVC(C=10, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
classifier = classifier.fit(train_data, train_labels)

test_data = scaler.fit_transform(test_data)
predicted_labels = predict_labels(test_data, classifier)

and got:

>>> correct_labels = count_correct_labels(predicted_labels, test_labels)
>>> print_stats(correct_labels, len(test_labels))
Correct labels = 1624
Accuracy = 88.5979268958

with these methods:

def count_correct_labels(predicted_labels, test_labels):
    correct = 0
    for p, t in zip(predicted_labels, test_labels):
        if p[0] == t:
            correct = correct + 1
    return correct

def print_stats(correct_labels, len_test_labels):
    print "Correct labels = " + str(correct_labels)
    print "Accuracy = " + str((correct_labels * 100 / float(len_test_labels)))

I was able to optimize more with more hyper parameter tuning!

Helpful link: RBF SVM parameters


Note: If I don't transform the test_data, accuracy is 52.7%.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • Thank you @shahins, your help was precious! I also plotted the [confusion matrix](http://stackoverflow.com/questions/19233771/sklearn-plot-confusion-matrix-with-labels). Hopefully my answer will help a bit too the future user, despite its *0* score! :) – gsamaras Sep 21 '16 at 00:40