Let's say people email me with problems they are experiencing with a program. I would like to teach the machine to classify these emails into "issue type" classes based on the words used in each email.
I have created two CSV files which respectively contain:
- the word contents of each email
- the class each email would be labeled as
Here is an image showing the two CSV files
I'm attempting to feed these data into Scikit-Learn's SVC algorithm in Python 3. But, as far as I can tell, the CSV file with email contents can’t be directly passed into SVC; it seems to only accept floats.
I try to run the following code:
import pandas as pd
import os
from sklearn import svm
from pandas import DataFrame
data_file = "data.csv"
data_df = pd.read_csv(data_file, encoding='ISO-8859-1')
classes_file = "classes.csv"
classes_df = pd.read_csv(classes_file, encoding='ISO-8859-1')
X = data_df.values[:-1] #training data
y = classes_df.values[:-1] #training labels
#The SVM classifier requires the specific variables X and y
#an array X of size [n_samples, n_features] holding the training samples,
#and an array y of class labels (strings or integers), size [n_samples]
clf = svm.SVC(gamma=0.001, C=100)
clf.fit(X, y)
When I run this, I receive a "ValueError" on the final line, stating "could not convert string to float", followed by the contents of the first email in the "data.csv" file. Do I need to convert these email contents to floats in order to feed them into the SVC algorithm? If so, how would I go about doing that?
I've been reading at http://scikit-learn.org/stable/datasets/index.html#external-datasets and it states
Categorical (or nominal) features stored as strings (common in pandas DataFrames) will need converting to integers, and integer categorical variables may be best exploited when encoded as one-hot variables
Which then leads me to their documentation on PreProcessing Data, but I'm afraid I've become a bit lost as to where to go next. I'm not entirely sure what, exactly, I need to do with my email contents in order for it to work with the SVC algorithm.
I'd greatly appreciate any insights anyone could offer on how to approach this problem.