On evaluation of outlier rankings and outlier scores.
Schubert, E., Wojdanowski, R., Zimek, A., & Kriegel, H. P. (2012, April).
In Proceedings of the 2012 SIAM International Conference on Data Mining (pp. 1047-1058). Society for Industrial and Applied Mathematics.
In this publication, we not "just normalize" outlier scores, but we also suggest a unsupervised ensemble member selection strategy called "greedy ensemble".
However, normalization is crucial, and difficult. We published some of the earlier progress with respect to score normalization as
Interpreting and unifying outlier scores.
Kriegel, H. P., Kroger, P., Schubert, E., & Zimek, A. (2011, April).
In Proceedings of the 2011 SIAM International Conference on Data Mining (pp. 13-24). Society for Industrial and Applied Mathematics.
If you don't normalize your scores (and min-max scaling is not enough), you will usually not be able to combine them in a meaningful way, except with very strong preconditions. Even two different subspaces will usually yield incomparable values because of having a different number of features, and different feature scales.
There is also some work on semi-supervised ensembles, e.g.
Learning Outlier Ensembles: The Best of Both Worlds—Supervised and Unsupervised.
Micenková, B., McWilliams, B., & Assent, I. (2014).
In Proceedings of the ACM SIGKDD 2014 Workshop on Outlier Detection and Description under Data Diversity (ODD2). New York, NY, USA (pp. 51-54).
Also beware of overfitting. It's quite easy to arrive at a single good result by tweaking parameters and repeated evaluation. But this leaks evaluation information into your experiment, i.e. you tend to overfit. Performing well across a large range of parameters and data sets is very hard. One of the key observations of the following study was that for every algorithm, you'll find at least one data set and parameter set, where it 'outperforms' the others; but if you change parameters a little, or use a different data set, the benefits of the "superior" new methods are not reproducible.
On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study.
Campos, G. O., Zimek, A., Sander, J., Campello, R. J., Micenková, B., Schubert, E., ... & Houle, M. E. (2016).
Data Mining and Knowledge Discovery, 30(4), 891-927.
So you will have to work really hard to do a reliable evaluation. Be careful how to choose parameters.