Processing Word Data For Input into Scikit-Learn's SVC Algorithm

Question

Let's say people email me with problems they are experiencing with a program. I would like to teach the machine to classify these emails into "issue type" classes based on the words used in each email.

I have created two CSV files which respectively contain:

the word contents of each email
the class each email would be labeled as

Here is an image showing the two CSV files

I'm attempting to feed these data into Scikit-Learn's SVC algorithm in Python 3. But, as far as I can tell, the CSV file with email contents can’t be directly passed into SVC; it seems to only accept floats.

I try to run the following code:

import pandas as pd 
import os 
from sklearn import svm 
from pandas import DataFrame 


data_file = "data.csv" 
data_df = pd.read_csv(data_file, encoding='ISO-8859-1')

classes_file = "classes.csv" 
classes_df = pd.read_csv(classes_file, encoding='ISO-8859-1')


X = data_df.values[:-1] #training data
y = classes_df.values[:-1] #training labels
     #The SVM classifier requires the specific variables X and y
         #an array X of size [n_samples, n_features] holding the training samples, 
         #and an array y of class labels (strings or integers), size [n_samples]

clf = svm.SVC(gamma=0.001, C=100)
clf.fit(X, y)

When I run this, I receive a "ValueError" on the final line, stating "could not convert string to float", followed by the contents of the first email in the "data.csv" file. Do I need to convert these email contents to floats in order to feed them into the SVC algorithm? If so, how would I go about doing that?

I've been reading at http://scikit-learn.org/stable/datasets/index.html#external-datasets and it states

Categorical (or nominal) features stored as strings (common in pandas DataFrames) will need converting to integers, and integer categorical variables may be best exploited when encoded as one-hot variables

Which then leads me to their documentation on PreProcessing Data, but I'm afraid I've become a bit lost as to where to go next. I'm not entirely sure what, exactly, I need to do with my email contents in order for it to work with the SVC algorithm.

I'd greatly appreciate any insights anyone could offer on how to approach this problem.

seralouk · Accepted Answer · 2017-08-01T19:57:15.573

0

Yes you need to encode the categorical features and the use them then for the SVC.

You can use DictVectorizer for the data_df features and then LabelEncoder for the classes_df.

This is the sample data that I used : https://www.dropbox.com/sh/kne5wopgzeuah0u/AABKTuc3_1czzI0hIpZWPkLwa?dl=0

Using your exact same data the following works fine:

import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn import preprocessing
from sklearn import svm 

data_file = "data.csv" 
data_df = pd.read_csv(data_file, encoding='ISO-8859-1')

classes_file = "classes.csv" 
classes_df = pd.read_csv(classes_file, encoding='ISO-8859-1')

# label encoding
lab_enc = preprocessing.LabelEncoder()
labels_new = lab_enc.fit_transform(classes_df) 

# vectorize training data
train_as_dicts = [dict(r.iteritems()) for _, r in data_df.iterrows()]
train_new = DictVectorizer(sparse=False).fit_transform(train_as_dicts)

clf = svm.SVC(gamma=0.001, C=100)
clf.fit(train_new, labels_new)

This works fine.

Hope this helps

EDIT

I used the following text found on internet as a feature in data.csv.

The following is the first element of the Description column.

But shortly after that first report, it was shown the initial statement was misleading. The Times reported that Trump Jr. accepted the meeting in hopes that it would yield damaging information on Hillary Clinton, and Trump Jr. said it had not. After the Times obtained an email chain showing an acquaintance, Rob Goldstone, offered Trump Jr. a meeting where he could obtain information as part of a Russian government effort to help his father's campaign, Trump Jr. posted the emails online.But shortly after that first report, it was shown the initial statement was misleading. The Times reported that Trump Jr. accepted the meeting in hopes that it would yield damaging information on Hillary Clinton, and Trump Jr. said it had not. After the Times obtained an email chain showing an acquaintance, Rob Goldstone, offered Trump Jr. a meeting where he could obtain information as part of a Russian government effort to help his father's campaign, Trump Jr. posted the emails online.

The length is:

len(data_df['Description'][0])

982

The code worked fine again.

EDIT 2

I am using:

sklearn.__version__
'0.18.2'

pandas.__version__
u'0.20.3'

edited Aug 01 '17 at 19:57

answered Aug 01 '17 at 15:35

seralouk

30,938
9
118
133

Thanks for the quick response! When I run this code, it now states "ValueError: Input contains NaN, infinity or a value too large for dtype('float64')." I assume the value may be too large, since the contents of any given email can be fairly lengthy. I'm not sure how I would go about diagnosing exactly what the problem is in this case. – Rudy Aug 01 '17 at 16:32
First can you tell me what versions of pandas and sklearn are you using ? also, yes, if you have very long strings that could be a problem. I added a link to see the sample data that I used and worked with this code. Can you upload all your data? – seralouk Aug 01 '17 at 17:01
1

sklearn version 0.18.1 and pandas version 0.20.1. Unfortunately I wouldn't be able to upload all of my data, since it contains sensitive information. I did strip down my data set to just the first 10 samples, with shortened email contents. This appeared to work, but did give one warning: site-packages\sklearn\preprocessing\label.py:129: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) – Rudy Aug 01 '17 at 18:09
I don't know if this will solve the problem but can you upgrade to sklearn 0.18.2 ? Do not worry about the warning – seralouk Aug 01 '17 at 18:10
Also try to run the code from the terminal instead of jupyter – seralouk Aug 01 '17 at 18:28
I'm attempting to update sklearn using conda right now, but it seems to be throwing HTTP errors: _conda update scikit-learn Fetching package metadata ... CondaHTTPError: HTTP None None for url _ – Rudy Aug 01 '17 at 18:29
I'm running the "conda" update command in a command prompt in Windows. Screenshot here: http://i.imgur.com/4qUjuEF.png – Rudy Aug 01 '17 at 18:40
If you run your code inside the jupyter then you have to do this https://stackoverflow.com/a/41778267/5025009 – seralouk Aug 01 '17 at 18:42
When I try to run the conda update command in Jupyter, I receive this error: http://i.imgur.com/xcPsAvo.png – Rudy Aug 01 '17 at 18:48
I haven't used jupyter. The best case would be to run the code from the terminal. So just update sklearn from terminal. So 1) open the terminal as admin 2) no need to cd inside anaconda, when the terminal opens type: conda and see if it recognizes the command – seralouk Aug 01 '17 at 18:52
Yep, it recognizes the command. I presume it's because Anaconda and its Library/bin and Scripts directories are part of my Windows PATH variable. The update command is currently throwing an HTTP error when I attempt to update sklearn: http://i.imgur.com/9cSJoUz.png – Rudy Aug 01 '17 at 18:56
Whoops! Looks like my work's internal network was preventing me from being able to access the repositories. I've just updated conda and scikit-learn. – Rudy Aug 01 '17 at 19:08
ohh. could not have guessed that ! let me know when you run my code. – seralouk Aug 01 '17 at 19:19
I'm afraid I'm still getting the same error, "ValueError: Input contains NaN, infinity or a value too large for dtype('float64')" when running the code. – Rudy Aug 01 '17 at 19:21
Okay so now we are sure that the length of the string is huge. Cannot see a way to solve this.. maybe use the first 5 lines for each email ? However, if the answer helped you can upvote or accept it. Finally, can you post only one email so that I can verify that it does not work on my system ? – seralouk Aug 01 '17 at 19:23
Thanks for the input! I'll start brainstorming ways to deal with the length issue. The longest email contents appears to be 1070 characters long. Unfortunately I can't share the contents of the actual email, though. – Rudy Aug 01 '17 at 19:36
@Rudy the thing is that I just used a 982 length text as input and it worked fine – seralouk Aug 01 '17 at 19:46
Hmmm. I'm not entirely sure what the problem might be, in that case. I haven't leafed through every email to look for odd characters, but could it possibly be some kind of special character throwing off the parser, or something like that? – Rudy Aug 01 '17 at 21:59
@if you have spaces between paragraphs maybe yes. If you use something like the paragraph that I added in my answer it should work fine. See also the versions that I use. Try to update also pandas. Finally, if the answer is useful you can upvote/accept it – seralouk Aug 01 '17 at 22:09
Ah. Thanks. I've upvoted the answer but I don't think it shows since I don't have 10 reputation yet. Also I'm leafing through some of the cells in the CSV file and they appear to be empty; that might be causing them to appear as NaN when parsed. I'll clean up the CSV file and I believe your code should suit my purposes. Thanks a bunch for the help! – Rudy Aug 01 '17 at 22:12
I just got through cleaning up the CSV file, and the code ran without the NaN error. Thanks a ton! It is still giving the "DataConversionWarning" though. I'm not entirely sure what that is about; I'm still new to Python so I'm still learning the data types and such. – Rudy Aug 01 '17 at 22:51
great news. do not worry about the warning. you can avoid it of you reshape the labels variable but no need – seralouk Aug 01 '17 at 22:52

Processing Word Data For Input into Scikit-Learn's SVC Algorithm

1 Answers1