I tried to get a very simple scikit OneVsRest classifier working, but am running into a strange issue
Here is the code
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
input_file = "small.csv"
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_csv(input_file, sep=',', quotechar='"', encoding='utf-8')
codes = df.ix[:,'act_code1':'act_code33']
y = []
for index, row in codes.iterrows():
row = row[np.logical_not(np.isnan(row))].astype(str)
row = row.tolist()
y.append(row)
lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y)
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(df['text'], Y)
predicted = classifier.predict(["BASIC SOCIAL SERVICES AID IN ARARATECA VALLEY"])
all_labels = lb.inverse_transform(predicted)
print all_labels
The contents of small.csv are here:
https://drive.google.com/file/d/0Bzt48lX3efsQTnYySFdaTlZhZGc/view?usp=sharing
when is attempts to classify, I get the following warning, and no classification happens
UserWarning: indices array has non-integer dtype (float64)
% self.indices.dtype.name)
[()]
However, if you remove the line that begins (line #6):
61821559,LEATHER PROJECT SKILLS TRAININ
The code works as it should as produces the correct classification output ([('15150.07',)]). You can also 'fix' this by removing the last line. What is going on here?
EDIT: Just to make sure I communicated the problem correctly: this is a text label classification problem, not a numeric regression curve fit. The 'numbers' in the labels are meant to be treated as text strings (which they are). This is a multi label classification problem.