Is there any function to calculate distance between mixed attribute dataset. For example, how to calculate distance D = d1 - d2
? where d1(100,TCP,1480)
and d2(200,ICMP,1650)
.

- 5,753
- 72
- 57
- 129
2 Answers
If you happen to be using the dreaded KDDCup 1999 data set, please read this answer: https://stackoverflow.com/a/22522174/1060350 - the data set is useless, so don't use it anymore.
You can try distances such as Gower's distance. But most likely, they won't be of any use on netflow data. You should try to incorporate domain knowledge instead: answer the question when are two netflows similar, then put this into an equation; instead of trying to find an equation that magically works.
One of the reasons why Gower or any other stock distance function will not work is that network data has very skewed distributions, and usually no negative values. It just is not a real Euclidean space.

- 1
- 1

- 76,138
- 12
- 138
- 194
In engineering and science we make use of dimensionless numbers to describe situations, and use relevant characteristic scales to create those dimensionless numbers. For example, if you were examining turbulent fluid flow you might well be bewildered by the apparently numerous variables. But turbulent fluid flow is dominated by the interplay of momentum acting against viscosity. It can be shown that there are actually only a few key characteristic measures of a system, and the interplay can be expressed as a ratio. The ratio is dimensionless (it is called the Reynolds number). A large value means turbulent flow, a low value means laminar (smooth) flow. This number is therefore a kind of distance function, indicating how distant we are from impeturbable smooth flow. In relativity, distances in space and time canbe expressed as a single distance by converting the time difference to a length by multiplying by the speed if light, then treating that length just like the 3 space dimensions, because the speed of light is a characteristic velocity scale for the situation.
So, you ought to use some domain knowledge to do likewise.
However, you should also stop to ask yourself whether distance is even a meaningful concept. Distance is a measure on a proportional scale: we can speak meaningfully of one distance being twice another distance. If the atrributes you are considering are not measured on proportional scales, to talk about distance is nonsense. I note that your data includes "TCP" and "ICMP", which are unordered, discrete values. Distance might simply be a meaningless concept for your data set.

- 46,613
- 43
- 151
- 237