0

I have a large data set approximately (35000 x 27). I am running sklearn SVM in linear and polynomial regressions. My run times are sometimes 30 mins or more. Is there a more efficient way to run my SVM?

I have tried removing unnecessary displays of data, and trying different mixes of test and train but it is always close to being the same duration. Running gaussian or "RBF" runs in about 6 minutes however but with much lower accuracy.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics

proteindata = pd.read_csv("data.csv")
np.any(np.isnan(proteindata))

print(proteindata.shape)
print(proteindata.columns)  
print(proteindata.head())

X = proteindata.drop("Class", axis=1)
y = proteindata["Class"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)

Classifier = svm.SVC(kernel='poly')
Classifier.fit(X_train, y_train)

y_pred =  Classifier.predict(X_test)

print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

I am not getting any errors besides being told to set gamma manually.

Artemij Rodionov
  • 1,721
  • 1
  • 17
  • 22
Meadowlion
  • 39
  • 9
  • Wait, are you regressing or classifying? FYI, SVM is for binary classes only. If you have multiple target values, Sklearn trains a model for each pair of targets (OVO). If you have 5 different categories, it's 32 different classifiers to train. That might explain it. – Nicolas Gervais Oct 31 '19 at 21:41
  • Its a 2 category classification , 1 or 0. – Meadowlion Oct 31 '19 at 23:46

1 Answers1

1

Take a look at this answer which covers the idea of using ensembles of smaller trained models to decide on a best classifier. The idea is essentially to perform training over a lot of smaller subsets of the data. Taking an aggregate model then still incorporates information from all of the data without training on all of the data at once (though it won't be exactly equivalent). Since SVM training time scales quadratically with the number of samples, training on subsets of the data should be much quicker.

foureyes9408
  • 113
  • 8