0

I tried to get a very simple scikit OneVsRest classifier working, but am running into a strange issue

Here is the code

import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing

input_file = "small.csv"

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv(input_file, sep=',', quotechar='"', encoding='utf-8')  

codes = df.ix[:,'act_code1':'act_code33']

y = []

for index, row in codes.iterrows():
  row = row[np.logical_not(np.isnan(row))].astype(str)
  row = row.tolist()
  y.append(row)

lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y)

classifier = Pipeline([
   ('vectorizer', CountVectorizer()),
   ('tfidf', TfidfTransformer()),
   ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(df['text'], Y)

predicted = classifier.predict(["BASIC SOCIAL SERVICES AID IN ARARATECA VALLEY"])

all_labels = lb.inverse_transform(predicted)

print all_labels

The contents of small.csv are here:

https://drive.google.com/file/d/0Bzt48lX3efsQTnYySFdaTlZhZGc/view?usp=sharing

when is attempts to classify, I get the following warning, and no classification happens

UserWarning: indices array has non-integer dtype (float64)
  % self.indices.dtype.name)
[()]

However, if you remove the line that begins (line #6):

61821559,LEATHER PROJECT SKILLS TRAININ

The code works as it should as produces the correct classification output ([('15150.07',)]). You can also 'fix' this by removing the last line. What is going on here?

EDIT: Just to make sure I communicated the problem correctly: this is a text label classification problem, not a numeric regression curve fit. The 'numbers' in the labels are meant to be treated as text strings (which they are). This is a multi label classification problem.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Scott Stewart
  • 193
  • 1
  • 1
  • 5

1 Answers1

0

The problem is the following part of your code:

y = []

for index, row in codes.iterrows():
  row = row[np.logical_not(np.isnan(row))].astype(str)
  row = row.tolist()
  y.append(row)

print(y)

[['12105.01', '15150.07', '15130.06', '11105.01', '16010.07', '16020.05'], ['99810.01'], ['11430.02', '15140.01'], ['16010.05', '15150.07'], ['32120.08', '32181.01', '16010.01'], ['99810.01'], ['72020.01'], ['72010.01']]

the numeric values of act_code are not the labels... the column names act_code themselves are. BTW, you are doing a classification task right? If I understand you correctly, based on the text input, you try to classify it into one/more of act_code 1:33. If your true purpose is to predict some numeric value (in your post, output ([('15150.07',)]) really confuses me), then you have to reformulate all your project entirely because it's a regression problem then rather than classification.

You should instead use

y = [row.index[row.notnull()].tolist() for _, row in y_codes.iterrows()]

[[u'act_code1', u'act_code2', u'act_code3', u'act_code4', u'act_code5', u'act_code6'], [u'act_code1'], [u'act_code1', u'act_code2'], [u'act_code1', u'act_code2'], [u'act_code1', u'act_code2', u'act_code3'], [u'act_code1'], [u'act_code1'], [u'act_code1']]

Full working code:

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn import preprocessing
import pandas as pd

input_file = '/home/Jian/Downloads/small.csv'
df = pd.read_csv(input_file, sep=',', quotechar='"', encoding='utf-8')
y_codes = df.ix[:,'act_code1':'act_code33']

# process your y-label
# ==============================
y = [row.index[row.notnull()].tolist() for _, row in y_codes.iterrows()]

lb = lb = preprocessing.MultiLabelBinarizer()
Y = lb.fit_transform(y)

print(Y)

# standard text classificaiton with multi-label classes
# ======================================================
# CountVectorizer + TfidTransformer is equivalent to TfidfVectorizer
classifier = make_pipeline(TfidfVectorizer(), OneVsRestClassifier(LinearSVC()))

X = df.text.values
# give a warning msg: Label 0 is present in all training examples.
# it's fine since this is just a very small sample
# in reality, it's unlikely for all your obs belong to class 0
classifier.fit(X, Y)

y_pred = classifier.predict(["BASIC SOCIAL SERVICES AID IN ARARATECA VALLEY"])

all_labels = lb.inverse_transform(y_pred)

print(all_labels)

[(u'act_code1',)]
Jianxun Li
  • 24,004
  • 10
  • 58
  • 76
  • Thank Jian, but the values in the act_code fields are the labels--its not guaranteed the same values will be in act_code1 consistently. act_code1 could be 99810.01 in the first row, then 71109.90 in the next row. Is there no way to make it work using the numeric values as labels? I don't want the classification to answer its act_code1, but rather the numeric values. – Scott Stewart Jul 11 '15 at 22:35
  • @ScottStewart classification treat each label as categorical variable, and it assumes that you cannot compare between different label, for example, apple and orange are two labels and we cannot say apple is better than orange. But within the apple label, different apples may have different levels of sweetness, so one apple could be better than another by comparing its `numeric` sweetness. Similar logic applies to your task, you need to first classify which `act_code` this particular `text` belongs to, and then run a regression within that label to predict that value. – Jianxun Li Jul 11 '15 at 22:42
  • @ScottStewart all the code in my post deals with the classification part. You need to add a further regression part if you want any numeric predictions. – Jianxun Li Jul 11 '15 at 22:44
  • My labels look like this [['12105.01', '15150.07', '15130.06', '11105.01', '16010.07', '16020.05'], ['99810.01'], ['11430.02', '15140.01'], ['16010.05', '15150.07'], ['32120.08', '32181.01', '16010.01'], ['99810.01'], ['72020.01'], ['72010.01']], those are the categories, not the column in which they occur. It is not a numeric prediction. Those numbers are text labels. If you print y in my example, you will see they are text labels. they are casted to str – Scott Stewart Jul 11 '15 at 22:48
  • @ScottStewart I don't think these numeric values are labels... Let's wait for some answers from others. – Jianxun Li Jul 11 '15 at 22:54
  • Ok, just FYI this is example is what i used as the basis of my example http://stackoverflow.com/questions/10526579/use-scikit-learn-to-classify-into-multiple-categories – Scott Stewart Jul 11 '15 at 23:06