Outlier prediction with categorical data in Pythons Scikit-Learn lib

Question

Im trying to make prediction with my own output. Im using Python Scikit-learn lib and Isolation Forest as algorithm. I do not know what am I doing wrong, but when I want to transform my input data I always get an error. I get an error in this line:

    input_par = encoder.transform(val)#ERROR

this is the error: Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

And I have tried this, but I always get an error:

    input_par = encoder.transform([val])#ERROR

this is the error: alueError: Specifying the columns using strings is only supported for pandas DataFrames

What am I doing wrong, how can I fix this error? Also, should I use OneHotEncoder, LabelEncoder or CountVectorizer?

This is my code:

import pandas as pd

from sklearn.ensemble import IsolationForest
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', 'wrong', 'hi, how are you, are you doing good']
num_data = [4, 1, 3, 2, 65, 3,3]

df = pd.DataFrame({'my text': textual_data,
                   'num data': num_data})
x = df

# Transform the features
encoder = ColumnTransformer(transformers=[('onehot', OneHotEncoder(), ['my text'])], remainder='passthrough')
#encoder = ColumnTransformer(transformers=[('lab', LabelEncoder(), ['my text'])])

x = encoder.fit_transform(x)

isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
model = isolation_forest.fit(x)

list_of_val = [['good work',2], ['you are wrong',54], ['this was amazing',1]]

for val in list_of_val:

    input_par = encoder.transform(val)#ERROR

    outlier = model.predict(input_par)
    #print(outlier)

    if outlier[0] == -1:
        print('Values', val, 'are outliers')

    else:
        print('Values', val, 'are not outliers')

EDIT:

I have also tried this:

list_of_val = [['good work',2], ['you are wrong',54], ['this was amazing',1]]

for val in list_of_val:

    input_par = encoder.transform(pd.DataFrame({'my text': val[0],
                                               'num data': val[1]}))

But I get this error:

ValueError: If using all scalar values, you must pass an index

The above mentioned code is simply encoding the sentences to one hot encoding. Are you sure you want to encode the sentences or you want to encode the tokens contained in the sentences. — abheet22, Oct 03 '19 at 13:35
I want to find outliers , to check if my input text is outlier or no, is it possible to do this with text data? Also, what should I use for encoding ? — taga, Oct 03 '19 at 13:47
So I think your problem statement is, based on the context of the sentence you want to find the outlier. How you determined that -1 prediction is an outlier?? — abheet22, Oct 03 '19 at 13:55
I'm learning about that, do you have any advice? Also, Can you please help me how to fix this error: `input_par = encoder.transform(val)#ERROR` — taga, Oct 03 '19 at 14:11
Have you consideered pritning out the variable `val` right before the error happens ? It seems to me that it is one-dimensional instead of 2-dimensional — Joseph Budin, Oct 07 '19 at 07:45
I have done that, read the question, I have wrote what I tried and what errors i get — taga, Oct 07 '19 at 09:24

score 3 · Answer 1 · answered Oct 12 '19 at 22:04

I will try to make a list of observations that you will maybe find useful:

LabelEncoder can be used, for example, to transform non-numerical data into numerical labels. OneHotEncoder usually takes numerical or non-numerical data and converts it into, well, one-hot encodings. Both are usually used for preprocessing the "labels" (classes of a supervised learning problem).
As I understand it, you are trying to predict outliers (anomaly detection). It is not clear to me if the connection between the utterances and the integers is only hardcoded or if you want to generate this kind of connection somehow. If this is what you want, then you cannot achieve this using previously mentioned encoders because you are fitting them on some data (that, in general, should be labels) and trying to transform new unrelated data (ValueError: y contains previously unseen labels). However, this can be fixed by setting the handle_unknown parameter of OneHotEncoder to 'ignore' (From Documentation: "Whether to raise an error or ignore if an unknown categorical feature is present during transform"). Even if you can achieve what you want with one of these Encoders, you should keep in mind that this is not the main purpose of it.

I assume you are giving a high value to "negative" utterances (even if "wrong" doesn't correspond to 65 in your train data) and a small value to "positive" ones. If you assume you already know every integer for every utterance you can train the model on what is considered "positive" examples and give "negative" examples (outliers) only in testing. You don't train an IsolationForest on "positive" and "negative" examples - this would just basic binary classification that can be modelled with a Decision Tree for example. An intuitive example of IsolationForest can be seen here. Below is the code for your problem:

import numpy as np
from sklearn.ensemble import IsolationForest

textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', ...]
integer_connection = [1, 1, 2, 3, 2, 2, 3, 1, 3, 4, 1, 2, 1, 2, 1, 2, 1, 1]
integer_connection = np.array([[n] for n in integer_connection])

isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
isolation_forest.fit(integer_encoded)

list_of_val = [['good work', 2], ['you are wrong', 54], ['this was amazing', 1]]

text_vals = [d[0] for d in list_of_val]
numeric_vals = np.array([[d[1]] for d in list_of_val])

print(integer_encoded, numeric_vals)

outliers = isolation_forest.predict(numeric_vals)
print(outliers)

In general, I don't think your approach is right regarding outliers prediction for natural language utterances. For what you are trying to do in this specific example I can recommend using word vectors similarity from, for example, spaCy, or maybe a simple bag of words approach.

If you don't care of any of these points and you only want a working code, here is my version of what you are trying to do:

import numpy as np

from sklearn.ensemble import IsolationForest
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder


textual_data = ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', 'wrong', 'hi, how are you, are you doing good']


encodings = {}

num_data = [4, 1, 3, 2, 65, 3, 3]


onehot_encoder = OneHotEncoder(handle_unknown='ignore')
onehots = onehot_encoder.fit_transform(np.array([[utt, no] for utt, no in zip(textual_data, num_data)]))

for i, l in enumerate(onehots):
    original_label = (textual_data[i], num_data[i])
    encodings[original_label] = l

print(encodings)

isolation_forest = IsolationForest(contamination = 'auto', behaviour = 'new')
model = isolation_forest.fit(onehots)

list_of_val = [['good work', 2], ['you are wrong', 54], ['this was amazing', 1]]


test_encoded = onehot_encoder.transform(np.array(list_of_val))
print(test_encoded)

outliers = isolation_forest.predict(test_encoded)
print(outliers)

for i, outlier in enumerate(outliers):
    if outlier == -1:
        print('Values', list_of_val[i], 'are outliers')

    else:
        print('Values', list_of_val[i], 'are not outliers')

score 1 · Answer 2 · answered Oct 09 '19 at 12:38

Are you sure it makes sense what you are doing? Your OneHotEncoder() encodes your categorical variable ('my text') using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. Think of it as a mapping between your labels and a numeric return.

In your textual_data you have 7 different labels: ['i love you', 'I love your dress', 'i like that', 'thats good', 'amazing', 'wrong', 'hi, how are you, are you doing good']. Each of these will be encoded. This happens during your:

>>> x = encoder.fit_transform(x)
>>> print(x)
<7x8 sparse matrix of type '<class 'numpy.float64'>'
    with 14 stored elements in Compressed Sparse Row format>

Here your encoder creates a mapping for all 7 labels.

When you continue with your script and want to use that same encoder to transform a new label it fails:

>>> to_predict = pd.DataFrame({'my text': ['good work', 'you are wrong', 'this was amazing'],
                               'num data': [2, 54, 1]})
>>> encoder.transform(to_predict)
ValueError: Found unknown categories ['this was amazing', 'good work', 'you are wrong'] in column 0 during transform

It can't find those labels in its mapping. However if you have new observations for which your labels are part of your mapping it would be able to transform them:

>>> to_predict = pd.DataFrame({'my text': ['i like that', 'i love you', 'i love you'],
                               'num data': [2, 54, 1]})
>>> encoder.transform(to_predict)
<3x8 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

What you could do instead is add those new observations with new labels to your original df and run them again through your pipeline so they become part of your mapping.

I must admit that I'm not experienced at all with this so please correct me if I'm wrong, but that's the way it looks to me. Good luck with your project.

Check my question, I have updated it. At the end of question I put what I have tried. I have tried to pass dataframe into `encoder.transform(...)` but I get the same error. When I made regression model with categorical data, I passed the dataframe into `encoder.transform(...)` and it worked, I do not know why now its not working. I have done the same thing, Im just using different algorithm — taga, Oct 09 '19 at 13:51
`{'my text': [val[0]], 'num data': [val[1]]}` to avoid the ValueError: If using all scalar values, you must pass an index — Arno Maeckelberghe, Oct 09 '19 at 13:57
I have done that, i get this error: `ValueError: Found unknown categories ['good work'] in column 0 during transform` — taga, Oct 10 '19 at 11:23
this answer raises good point, your test data contains categories not present in training, so it will never work. try converting `list_of_val ` to df first, concatenate with `x` row-wise, call encoder.fit() on this new df, then individually `transform` both dfs — Shihab Shahriar Khan, Oct 11 '19 at 11:14

score 0 · Answer 3 · answered Oct 07 '19 at 14:51

0

you have a very similar problem to

AttributeError when using ColumnTransformer into a pipeline

As described there, it is recommended to use pandas for your encoding (there is also an example for one-hot-encoding). I hope that helps!

answered Oct 07 '19 at 14:51

Lazloo Xp

858
1
11
36

I have tried that solution but It does not work. Thats why I asked a question and put bounty on it – taga Oct 09 '19 at 06:31

score 0 · Answer 4 · answered Oct 07 '19 at 15:04

0

Try to convert your list list_of_val into a numpy array by running

import numpy as np
list_of_val = np.asarray(list_of_val)

answered Oct 07 '19 at 15:04

secretive

2,032
7
16

Still does not work. I get an error: `ValueError: Specifying the columns using strings is only supported for pandas DataFrames ` – taga Oct 08 '19 at 09:20

score 0 · Answer 5 · answered Jan 27 '22 at 12:11

Received such message when data was a single variable, also known as time-series :)

PDF is a Pandas DataFrame

### pick real data
X_train = PDF.y          # single dimension, or time-series
y_train = PDF.isAnomaly  # validation variable

### reshape for isolation forest
X_train = np.array(X_train).reshape(-1, 1)

Outlier prediction with categorical data in Pythons Scikit-Learn lib

5 Answers5

Linked