Loading a Dataset for Linear SVM Classification from a CSV file

Question

I have a csv file below called train.csv:

   25.3, 12.4, 2.35, 4.89, 1, 2.35, 5.65, 7, 6.24, 5.52, M
   20, 15.34, 8.55, 12.43, 23.5, 3, 7.6, 8.11, 4.23, 9.56, B
   4.5, 2.5, 2, 5, 10, 15, 20.25, 43, 9.55, 10.34, B
   1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5, M

I am trying to get this dataset be separated and classified as the following (This is the output I want):

    [[25.3, 12.4, 2.35, 4.89. 1, 2.35, 5.65, 7, 6.24, 5.52], 
    [20, 15.34, 8.55, 12.43, 23.5, 3, 7.6, 8.11, 4.23, 9.56], 
    [4.5, 2.5, 2, 5, 10, 15, 20.25, 43, 9.55, 10.34], 
    [1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5]], 
    [M, B, B, M]

The one in "[[" is the x (the sample data) and the one in "[M, M, B, B, M]" is the y (which is the classification that matches with its set of data.

I am trying to create a python code that's been loaded and can print out the data being separated by data and it's classification. It's related to linear SVM.

y_list = []
x_list = []
for W in range(0, 100):
    X = data_train.readline()
    y = X.split(",")
    y_list.append(y[10][0])
    print(y_list)
    z_list = []
    for Z in range(0, 10):
        z_list.append(y[Z])
    x_list.append(z_list)
    dataSet = (x_list, y_list)
    print(dataSet)

Note: I know my range is completely wrong. I'm not sure how to fit the range at all for this type of example, could anyone please explain how the range would work in this situation.

Note: I know the append line where it is "y[10][0]" is also wrong as well. Could someone explain how these indexes work.

Overall I want the output to be the output I stated above. Thanks for the help.

As far as understand, you want to predict the classification based on the other data right? Have you looked at `train_test_split` function from `sklearn.model_selection` ? — Snusifer, Nov 20 '19 at 22:17

score 3 · Accepted Answer · answered Nov 20 '19 at 22:27

First, I think you have an error in your CSV in the first row:

25.3, 12.4, 2.35, 4.89. 1, 2.35, 5.65, 7, 6.24, 5.52, M

I just assumed it should be 4.89, 1, and not 4.89. 1.

Second, I recommend you to use pandas to read that CSV, and then do this:

import pandas as pd
data = pd.read_csv('prueba.csv', header=None, usecols=[i for i in range(11)])
# the usecols=[i for i in range(11)] will create a list of numbers for your columns
# that line will make a dataframe called data, which will contain your data.
l = [i for i in range(10)]
X_train = data[l]
y_train = data[10]

This is the most easy way to have ready your data for any machine learning algorithm in scikit-learn.

Holy moly I think I can use that. Actually in the problem I am supposed to do I need to use sklearn.model.selection KFold. Would that work with Panda? — user20304030, Nov 20 '19 at 22:55
Yes, of course. X_train and y_train in the example above are ready to be used with KFold — victor noriega, Nov 20 '19 at 23:00

score 0 · Answer 2 · answered Nov 20 '19 at 22:25

0

import pandas as pd

df = pd.read_csv(/path/to/csv, header=None, index_col=False)
x = df.iloc[:,:-1].values
y = df.iloc[:,-1:].values

answered Nov 20 '19 at 22:25

loki

976
1
10
22

score 0 · Answer 3 · answered Nov 20 '19 at 22:30

I think you should use pandas, which is a library that helps you with reading csv:

import pandas as pd

dataset = pd.read_csv('train.cvs')

Second you can use train_test_split to automatically split the data:

X_train, X_test, y_train, y_test = train_test_split(
        X, y, stratify=y, test_size=0.2)

This will split the data where X_train and X_test comprises of 80% of the data and y_train, y_test 20%. This can be changed with adjusting test_size. stratify will automatically make the ratio of classification count (M, B) equal in train and test, which is generally considered good practise in machine learning. This will generate random split each time. If you want the same split you can use random_state=(SEED) as keyword argument.

After that you can continue on with the machine learning:

from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report

# Important to scale
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

clf = SVC()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

print(classification_report(y_test, pred))
print(confusion_matrix(y_test, pred))

Could KFold selection be applied instead of train_test_split for panda? That's what I need to use for the problem — user20304030, Nov 20 '19 at 22:57
Nvm, I found a good answer for you: https://stackoverflow.com/questions/45115964/separate-pandas-dataframe-using-sklearns-kfold — Snusifer, Nov 20 '19 at 23:31

Loading a Dataset for Linear SVM Classification from a CSV file

3 Answers3

Linked