5

I am using Isolation Forest in R to perform Anomaly Detection on multivariate data.

I tried calculating the anomaly scores along with contribution of individual metric in calculating that score. I am able to get the anomaly score but facing problem in calculating importance of metrics.

I am able to get the desired result through BigML(online platform) but not through R.

R code:

> library(solitude) # tried 'IsolationForest' and 'h2o' but not getting desired result
> mo = isolation_forest(data)
> final_scores <- predict(mo,data)
> summary(mo)
     Length Class  Mode
forest 14     ranger list

> head(final_scores,5)
[1] 0.4156554 0.3923926 0.4262782 0.4595296 0.4174865

Output from BigMl : enter image description here

I want to get the importance values for every metric(a,b,c,d) through R code, just like what I am getting in BigML

I think I am missing out some basic parameters. Actually I am new to R, so am not able to figure it out.

I have thought of something in order to get the feature importance at observation level but I am facing problem in implementing it.

Here is the snippet of what I am planning.

The dots in the metric are individual observations while the lines are splits based on specific variables.

I am able to trace individual trees of forest but the problem is that there are 500 trees in the forest and tracing individual tree and accessing their importance values is impractical. The below example is purely based on dummy data.

enter image description here

Output of individual tree:

> x = treeInfo(mo$forest,tree=3)
> x
   nodeID leftChild rightChild splitvarID splitvarName  splitval terminal prediction
1       0         1          2          2            c 0.6975663    FALSE         NA
2       1         3          4          1            b 0.3455875    FALSE         NA
3       2         5          6          0            a 0.2620023    FALSE         NA
4       3         7          8          0            a 0.1425075    FALSE         NA
5       4         9         10          0            a 0.6611566    FALSE         NA
6       5        NA         NA         NA         <NA>        NA     TRUE         10
7       6        NA         NA         NA         <NA>        NA     TRUE          2
8       7        NA         NA         NA         <NA>        NA     TRUE          6
9       8        NA         NA         NA         <NA>        NA     TRUE          1
10      9        NA         NA         NA         <NA>        NA     TRUE          3
11     10        NA         NA         NA         <NA>        NA     TRUE          5

Any kind of help is appreciated.

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291

1 Answers1

1

Local feature importance can be estimated with the package lime.

library(solitude)
library(lime)

First, some toy data:

set.seed(1234)
data<-data.frame(rnorm(20,0,1),rnorm(20,0,0.5))
colnames(data)<-c("x","y")
row.names(data)<-seq(1,nrow(data),1)

Have a look at the toy data:

plot(data)
text(data-0.05,row.names(data))

These cases appear to be outliers:

outliers<-c(4,20) 

Grow isolation forest:

model<-isolation_forest(data, importance="impurity")

As solitude is not supported in lime, we need to build two functions so that lime can handle solitude objects. The model_type function tells lime what kind of model we have. The predict_model function enables lime to predict with solitude objects.

model_type.solitude <- function(x, ...) {
  return("regression")
}

predict_model.solitude <- function(x, newdata, ...) {
  pred <- predict(x, newdata)
  return(as.data.frame(pred))
}

Then we can generate the lime object and estimate observation level feature importance (And number of permutations could be set higher for more reliable results):

lime1 <- lime(data, model)
importance <- data.frame(explain(data, lime1,
                             n_features = 2,n_permutations = 500 ))

Feature importance is in importance$feature_weight. Casewise inspection of results:

importance[importance$case %in% outliers,c("case","feature","feature_weight")]

Plot:

plot_features(importance[importance$case %in% outliers,] , ncol = 2)

Hope that's helpful!

Of course, read up on lime as it is based on certain assumptions.

Vriko
  • 11
  • 3
  • I think ranger is the implementation for Random Forest, but I am using Isolation Forest Also I tried using it but it didn't work – Sidharth Agarwal Mar 11 '19 at 18:02
  • I didn't mean to suggest that you use another package. Have you tried to insert the importance="impurity" option into your code? – Vriko Mar 12 '19 at 07:05
  • Edited the answer and added toy example. – Vriko Mar 12 '19 at 07:18
  • In your example there are 10 instances of a,b and c. I even get 10 anomaly scores for it but why only 1 row of importance values.It should give 10 set of importance values right ? score = predict(a,data). [1] 0.4873994 0.4873994 0.5412548 0.4873994 0.5833292 0.4873994 0.5174812 0.4323802 0.5833292 0.4873994 – Sidharth Agarwal Mar 12 '19 at 07:56
  • The importance measure tells you how much a given feature ( a, b c in the example) influences the result of your prediction. Thus, the same imporance values apply for every row in your prediction. – Vriko Mar 12 '19 at 10:14
  • Ok Thanks, but I am looking for something like the output from BigMl, which returns the importance values for every instance separately, as I want to study every anomalous instance – Sidharth Agarwal Mar 12 '19 at 10:19
  • I'm not familiar with BigMI, but from the output you posted it seems that there is also just one row of importance measures. I suppose the ":" in row 2 to five means that the importance values in row 1 are identical for all subsequent rows. Do you actually get different imortance values for each row? – Vriko Mar 12 '19 at 10:26
  • Yes, sorry for the incomplete output, but I am getting different importance values for every row – Sidharth Agarwal Mar 12 '19 at 10:30
  • After some digging I found out that there is a distiction between feature importance at model level vs. feature importance at observation level. The model level importance is what you get with the above code. Importance at observation level doesn't seem to be implemented in solitude / ranger. You could think about analysing the rsulting tree object yourself, but I think that's a rather big & complex undertaking. Sorry. – Vriko Mar 12 '19 at 10:39
  • Yeah anyways Thanks for your help !! – Sidharth Agarwal Mar 12 '19 at 10:43
  • Maybe you could rephrase your question to something like: "Is there a way to calculate feature importance at observation level in ranger / isolation forest?" With an accurate question it becomes more likely that someone can help you. Also, maybe update the BigML output to show what you actually want. – Vriko Mar 13 '19 at 06:48
  • Did you see the changes in the question, I have thought of something but its difficult to implement. Can you think of something ? – Sidharth Agarwal Mar 13 '19 at 10:40
  • As for how to implement this, you would have to trace each observation back through all the splits and record the number of times each predictor is used in a split. Do this for each observation across all the trees and then you can calculate the average importance of a given feature for each observation. Again, I think its a rather complex task. To get the structure of a tree, look at the treeInfo function from the ranger package. With the above example, try library(ranger) treeInfo(a$forest, tree = 1). There you get the structure of the tree, and maybe you can build on that. Good luck! – Vriko Mar 14 '19 at 08:29
  • The importance values for first instance are : [1] 0.004336022 0.007968072 -0.004297028 If I even sum them up they don't add to a whole(1 or 100) so are these values exactly the importance values ? – Sidharth Agarwal Mar 15 '19 at 20:16
  • LIME operates by permuting the features of an observation and recording the distance of the new prediction to the one with the original feature values (please see documentation for details). I supposse that the importance values are somehow derived from these distance measures. WIth an isolation forest that would mean that a feature becomes more important the more its importance value deviates from zero.To build a score that adds up to one: scores<-c(0.0043, 0.0079, -0.0042) and then std.scores<-abs(scores)/sum(abs(scores)). This could help distinguish relevant from irrelevant features. – Vriko Mar 19 '19 at 06:03
  • Wondering if you may consider updating this example with the latest implementation of solitude? Additionally they way I understand this example is that I build the explainer object on my training data set and use it on my scoring data set to generate local explanations. Please clarify. Thanks! – FlyingPickle Oct 03 '20 at 00:02