How to choose a proper distance function

Question

This may be a trivial question. How can we choose a good distance function for a special high-dimensional dataset? I have read that some distance functions such as Euclidean distance do not work well in high-dimensional data. If that can not give us a good distance measure then what function can?

therefore Dimensionality_Reduction techniques are preferable beforehand — JeeyCi, Aug 04 '23 at 06:55
Euclidean distance works for Clustering - showing similarity of data_points. Thus the choice of proper distance-metric depends on the goal of your research — JeeyCi, Aug 04 '23 at 06:58

score 3 · Accepted Answer · answered Aug 28 '12 at 20:27

It comes from the curse of dimensionality which basically is that space becomes exponentially more empty with increasing dimensionality.

The best distance measure is highly data dependent, but I'll recommend doing a cross validation with low values of p for minkowsky distance

mikowsky_distance = sum_i(|u_i-v_i|^p)^(1/p)

p=1 which is the manhattan distance (L1) is in most higher dimensional cases better then using euclidean (L2) and really easy to test. Also try taking smaller values like 1/4 and see what happens. You can also try with the limit p-> -inf which is the min-dstance min(|u_i-v_i|). The lower values on p makes the dimension with the most similarity have much more weight to it compare to the less matching dimensions.

I recommend reading the paper

http://www-users.cs.umn.edu/~kumar/papers/siam_hd_snn_cluster.pdf

which touches the subject.

How to choose a proper distance function

1 Answers1

Linked