I am currently working on detecting outliers in my dataset using Isolation Forest in Python and I did not completely understand the example and explanation given in scikit-learn documentation
Is it possible to use Isolation Forest to detect outliers in my dataset that has 258 rows and 10 columns?
Do I need a separate dataset to train the model? If yes, is it necessary to have that training dataset free from outliers?
This is my code:
rng = np.random.RandomState(42)
X = 0.3*rng.randn(100,2)
X_train = np.r_[X+2,X-2]
clf = IsolationForest(max_samples=100, random_state=rng, contamination='auto'
clf.fit(X_train)
y_pred_train = clf.predict(x_train)
y_pred_test = clf.predict(x_test)
print(len(y_pred_train))
I tried by loading my dataset to X_train
but that does not seem to work.