1

I have a dataset as follows:

X_data = 

BankNum   |  ID | 

00987772  | AB123 | 
00987772  | AB123 |
00987772  | AB123 |
00987772  | ED245 |
00982123  | GH564 |

And another one as:

y_data =

ID  | Labels

AB123 | High
ED245 | Low
GH564 | Low

I'm doing the following:

from sklearn import svm
from sklearn import metrics
import numpy as np

clf = svm.SVC(gamma=0.001, C=100., probability=True)
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.20, random_state=42)
clf.fit(X_train, y_train)

predicted = clf.predict(X_test)

But I want to know how do I transform this X_data to float before I do clf.fit()? Can I use DictVectorizer in this case? If yes, then how do I use it?

Also, I'm passing X_data and y_data through train_test_split to find out the prediction accuracy, but will it be splitting correctly? As in taking the correct Label for a ID in X_data from y_data?

UPDATE:

Can someone please tell me if I'm doing the following correctly?

new_df = pd.merge(df, df3, on="ID")
columns = ['BankNum', 'ID']
labels = new_df['Labels']
le = LabelEncoder()
labels = le.fit_transform(labels)
X_train, X_test, y_train, y_test = train_test_split(new_df[columns], labels, test_size=0.25, random_state=42)
X_train.fillna( 'NA', inplace = True )
X_test.fillna( 'NA', inplace = True )
x_cat_train = X_train.to_dict( orient = 'records' )
x_cat_test = X_test.to_dict( orient = 'records' )
vectorizer = DictVectorizer( sparse = False )
vec_x_cat_train = vectorizer.fit_transform( x_cat_train )
vec_x_cat_test = vectorizer.transform( x_cat_test )
x_train = vec_x_cat_train
x_test = vec_x_cat_test
clf = svm.SVC(gamma=0.001, C=100., probability=True)
clf.fit(x_train, y_train)
Espoir Murhabazi
  • 5,973
  • 5
  • 42
  • 73
Xavier
  • 227
  • 1
  • 3
  • 11
  • Are `X_data` and `y_data` dataframes? Do they come from a file? You can parse them as float when you read them the first time. – Antimony Sep 12 '17 at 23:16
  • @Antimony yes they are dataframes. I'm getting X_data from a database. – Xavier Sep 12 '17 at 23:22
  • How exactly do you want to represent the features as float btw? Seems like `ID` is not of type float. Also seems that `X_data`'s first 3 rows just repeat the same thing. – Antimony Sep 12 '17 at 23:26
  • Please see the UPDATE in my question. As for first 3 rows of X_data being duplicate, yes coz I'm extracting only certain columns from the database, there are other columns for them that have different values. But, we could merge the two dataframes together and then use only df[Labels] in place of y_data? – Xavier Sep 12 '17 at 23:37
  • yes , merging could be a good suggestion , and then you will need to transform the label to float by using different techniques likes LableBinarizer, LabelEncoding find more about that in [this](https://stackoverflow.com/a/45365714/4683950) answer I'm not sure that Dictcectorizer is what you need exactly – Espoir Murhabazi Sep 13 '17 at 05:49
  • No but the question here is not being able to transform just the labels, I can do that with LabelEncoder, but to transform the X_train, which is a data frame of BankNum & ID columns. – Xavier Sep 13 '17 at 05:52
  • @EspoirMurhabazi – Xavier Sep 13 '17 at 05:53
  • what is the type of BankNum columns??? are they int that you need to transform to float?? – Espoir Murhabazi Sep 13 '17 at 06:06
  • @EspoirMurhabazi the ID column is string, BankNum should be string as well. Can you see me UPDATE in the question and tell me if I'm doing it correctly? – Xavier Sep 13 '17 at 06:26

1 Answers1

0

my suggestion according to what we discus in comment is first to merge the x_data and y_data datasets on the id columns:

dataset = pd.merge(left=x_data, right=y_data, on='index')

and the you can transform the BANKacount columns to float by using np.astype :

dataset['Bank_Num'] = dataset.Bank_Num.astype(np.float128)

NB (update): Label _encoder can also works for Bank_Num if it contain some plain strings values :

dataset['Bank_Num'] = le.fit_transform(dataset.Bank_Num)

the ID columns by using label encoder to get the int representation of it :

from sklearn.preprocessing import LabelEncoder,LabelBinarizer
le = LabelEncoder()
dataset['index'] = le.fit_transform(dataset.index)

and the y label by using labelBinarizer :

lb = LabelBinarizer()
dataset['label'] = lb.fit_transform(dataset.label)

now you have a full dataset with int and float and your SVC can works well with it but before you need to split:

it is a good ideas to have a test size inferior to the train size , it may be preferable to use a value inferior to 0.5 for test_size find more about training set and test set size here

like this :

X_train, X_test, y_train, y_test = train_test_split(dataset[['index','Bank_Num']], dataset.label, test_size=0.25, random_state=42)

with this you can now train your classifier witout any problems:

clf.fit(X_train, y_train)

NB : in my code index is equivalent to your ID

Let me know if this help and how I can improve my answer

Espoir Murhabazi
  • 5,973
  • 5
  • 42
  • 73
  • The test size can be in (0, 1.0). Your comment about test_size is incorrect. – piman314 Sep 13 '17 at 08:56
  • thanks I edited it , check and approve if I'm correct, otherwise tell me what else i can edit – Espoir Murhabazi Sep 13 '17 at 09:09
  • Yeah that's more reasonable. I just didn't like that some new to ML might get confused with your previous version. – piman314 Sep 13 '17 at 09:48
  • @EspoirMurhabazi Great! But is there something wrong in the code I put in the question above? – Xavier Sep 13 '17 at 18:00
  • Yes , i can see your update test_size of 0.75 let me run it again and check – Espoir Murhabazi Sep 13 '17 at 18:02
  • @EspoirMurhabazi Also I'm getting error: ValueError: invalid literal for long double on dataset['Bank_Num'] = dataset.Bank_Num.astype(np.float128) line. – Xavier Sep 13 '17 at 18:17
  • @EspoirMurhabazi Sorry that was meant to be 0.25 instead of 0.75, that's a typo. I get the above error when I run your code. When I run mine, I keep getting Memory error before vec_x_cat_train = vectorizer.fit_transform( x_cat_train ) line. – Xavier Sep 13 '17 at 18:18
  • @EspoirMurhabazi Yes there seems to be a bank num like 123-645-272 something like this. – Xavier Sep 13 '17 at 18:25
  • another alternative is to remove the '-' in bank num and convert it into float , may be a good ideas – Espoir Murhabazi Sep 13 '17 at 18:29
  • I am still getting memory error. :( I am using r4.4xlarge Deep Learning AMI EC2 instance. How much bigger do I need :( There are about 600,000 rows in the dataset – Xavier Sep 13 '17 at 18:31
  • @EspoirMurhabazi I've put the trace back of the error in the question. – Xavier Sep 13 '17 at 18:50
  • Sorry just make a confusion, you should use label_encoder instead of label_binarizer check my edit – Espoir Murhabazi Sep 13 '17 at 19:39
  • 1
    @EspoirMurhabazi Thank you! This seems to work :) However, could you also tell me if instead of Bank Number, I want to use a column "Name", do I use Label Encoder for it too? – Xavier Sep 13 '17 at 20:45