2

This may be a trivial question. How can we choose a good distance function for a special high-dimensional dataset? I have read that some distance functions such as Euclidean distance do not work well in high-dimensional data. If that can not give us a good distance measure then what function can?

Amir Zadeh
  • 3,481
  • 2
  • 26
  • 47
  • therefore Dimensionality_Reduction techniques are preferable beforehand – JeeyCi Aug 04 '23 at 06:55
  • Euclidean distance works for Clustering - showing similarity of data_points. Thus the choice of proper distance-metric depends on the goal of your research – JeeyCi Aug 04 '23 at 06:58

1 Answers1

3

It comes from the curse of dimensionality which basically is that space becomes exponentially more empty with increasing dimensionality.

The best distance measure is highly data dependent, but I'll recommend doing a cross validation with low values of p for minkowsky distance

mikowsky_distance = sum_i(|u_i-v_i|^p)^(1/p)

p=1 which is the manhattan distance (L1) is in most higher dimensional cases better then using euclidean (L2) and really easy to test. Also try taking smaller values like 1/4 and see what happens. You can also try with the limit p-> -inf which is the min-dstance min(|u_i-v_i|). The lower values on p makes the dimension with the most similarity have much more weight to it compare to the less matching dimensions.

I recommend reading the paper

http://www-users.cs.umn.edu/~kumar/papers/siam_hd_snn_cluster.pdf

which touches the subject.

SlimJim
  • 2,264
  • 2
  • 22
  • 25