1

I'm new to Machine Learning and working on a project using Python(3.6), Pandas, Numpy and SKlearn. I have done classifications and reshaping but while in prediction it throws an error as contamination must be in (0, 0.5].

Here's what i have tried:

# Determine no of fraud cases in dataset
Fraud = data[data['Class'] == 1]
Valid = data[data['Class'] == 0]

# calculate percentages for Fraud & Valid 
outlier_fraction = len(Fraud) / float(len(Valid))
print(outlier_fraction)

print('Fraud Cases : {}'.format(len(Fraud)))
print('Valid Cases : {}'.format(len(Valid)))
# Get all the columns from dataframe
columns = data.columns.tolist()

# Filter the columns to remove data we don't want
columns = [c for c in columns if c not in ["Class"] ]

# store the variables we want to predicting on
target = "Class"
X = data.drop(target, 1)
Y = data[target]

# Print the shapes of X & Y
print(X.shape)
print(Y.shape)

# define a random state
state = 1

# define the outlier detection method
classifiers = {
    "Isolation Forest": IsolationForest(max_samples=len(X),
                                       contamination=outlier_fraction,
                                       random_state=state),
    "Local Outlier Factor": LocalOutlierFactor(
    contamination = outlier_fraction)
}
# fit the model
n_outliers = len(Fraud)

for i, (clf_name, clf) in enumerate(classifiers.items()):

    # fit te data and tag outliers
    if clf_name == "Local Outlier Factor":
        y_pred = clf.fit_predict(X)
        scores_pred = clf.negative_outlier_factor_
    else:
        clf.fit(X)
        scores_pred = clf.decision_function(X)
        y_pred = clf.predict(X)

    # Reshape the prediction values to 0 for valid and 1 for fraudulent
    y_pred[y_pred == 1] = 0
    y_pred[y_pred == -1] = 1

    n_errors = (y_pred != Y).sum()

    # run classification metrics 
    print('{}:{}'.format(clf_name, n_errors))
    print(accuracy_score(Y, y_pred ))
    print(classification_report(Y, y_pred ))

Here's what it returns :

ValueError: contamination must be in (0, 0.5]

and it throws this error for y_pred = clf.predict(X) line, as pointed in Traceback.

I'm new to machine learning, don't have much idea about ** contamination**, so where i did something wrong?

Help me, please!

Thanks in advance!

Abdul Rehman
  • 5,326
  • 9
  • 77
  • 150

2 Answers2

2

ValueError: contamination must be in (0, 0.5]

This means that contamination must be strictly larger than 0.0 and less than or equal to 0.5. (What does this square bracket and parenthesis bracket notation mean [first1,last1)? is a good question on the brackets notation) As you have commented, print(outlier_fraction) outputs 0.0, the problem lies in the first 6 lines of the code you posted.

versatile parsley
  • 411
  • 2
  • 6
  • 15
0

LocalOutlierFactor is an unsupervised outlier detection algorithm, introduced in this paper. Each algorithm, has its own parameters which really change the behavior of the algorithm. You should always study those parameters and their effect on the algorithm before applying the method, or else you may be lost in the land of massive parameter options.

In the case of LocalOutlierFactor, it assumes your outliers are not more than half of your dataset. In practice, I'd say, even if the outliers take up to 30% of your dataset, they're not outliers anymore. They're simply a different type, or class of data.

On the other hand, you cannot expect the outlier detection algorithm to work if you tell it that you have 0 outliers, which may be the case for you if the outlier_fraction is actually 0.

adrin
  • 4,511
  • 3
  • 34
  • 50