1

I've calculated the cosine similarity between two vectors. For instance, each vector can have x elements, V = {v[0], v[1], ...}, such as {age, height, ...}

Currently, I do not normalize on each element. In other words, elements that have higher absolute values tend to matter more in the similarity computation. e.g. if you have a person who is 180 cm tall and is only 10 years old, height is going to affect the similarity more than age.

I'm considering three variation of feature scaling, borrowed from wiki (http://en.wikipedia.org/wiki/Feature_scaling):

  1. Rescaling (subtract the min and divide by the range)
  2. Standardization (subtracting the mean and dividing by standard deviation)
  3. Using Percentiles (get the distribution of all values for a specific element and compute the percentiles the absolute value falls in)

It would be helpful if someone can explain the benefits to each and how I would go about determining what is the right method of normalization use. Having done all three, the sample results I get for instance is:

none: 1.0
standardized: 0.963
scaled: 0.981
quantile: 0.878
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
  • 1
    [This question and its answers](http://stats.stackexchange.com/questions/10289/whats-the-difference-between-normalization-and-standardization) might help. – Ioanna Mar 21 '17 at 12:31
  • Possible duplicate of [Linear Regression :: Normalization (Vs) Standardization](https://stackoverflow.com/questions/32108179/linear-regression-normalization-vs-standardization) – Shihe Zhang Aug 24 '17 at 03:03

0 Answers0