This may be a trivial question. How can we choose a good distance function for a special high-dimensional dataset? I have read that some distance functions such as Euclidean distance do not work well in high-dimensional data. If that can not give us a good distance measure then what function can?
-
therefore Dimensionality_Reduction techniques are preferable beforehand – JeeyCi Aug 04 '23 at 06:55
-
Euclidean distance works for Clustering - showing similarity of data_points. Thus the choice of proper distance-metric depends on the goal of your research – JeeyCi Aug 04 '23 at 06:58
1 Answers
It comes from the curse of dimensionality which basically is that space becomes exponentially more empty with increasing dimensionality.
The best distance measure is highly data dependent, but I'll recommend doing a cross validation with low values of p for minkowsky distance
mikowsky_distance = sum_i(|u_i-v_i|^p)^(1/p)
p=1 which is the manhattan distance (L1) is in most higher dimensional cases better then using euclidean (L2) and really easy to test. Also try taking smaller values like 1/4 and see what happens. You can also try with the limit p-> -inf which is the min-dstance min(|u_i-v_i|)
. The lower values on p makes the dimension with the most similarity have much more weight to it compare to the less matching dimensions.
I recommend reading the paper
http://www-users.cs.umn.edu/~kumar/papers/siam_hd_snn_cluster.pdf
which touches the subject.

- 2,264
- 2
- 22
- 25