1

I'm trying to use the l2 normalization on a double vector with Java.

double[] vector = {0.00423823948, 0.00000000000823285934, 0.0000342523505342, 0.000040240234023423, 0, 0};

Now if i use the l2 normalization

for(double i : vector){
    squareVectorSum += i * i;
}

normalizationFactor = Math.sqrt(squareVectorSum);
// System.out.println(squareVectorSum+" "+normalizationFactor);
for(int i = 0; i < vector.length; i++){
    double normalizedFeature = vector[i] / normalizationFactor;
    vector_result[i] = normalizedFeature;
}

My normalized vector is like this

Normalized vector (l2 normalization)
0.9999222784309146 1.9423676996312713E-9 0.008081112110203743 0.009493825603572155 0.0 0.0

Now if if make the squared sum of all the normalized-vector components I should get a sum that is is equal to one, instead my squared sum is

for(double i : vector_result){
    sum += i*i;
}
Squared sum of the normalized-vector
1.0000000000000004

Why is my sum not equal to one? Are there some problems in the code? Or it's just because my numbers are too small and there is some approximation with doubles?

andand
  • 17,134
  • 11
  • 53
  • 79
Jacopo Terrinoni
  • 173
  • 1
  • 14
  • possible duplicate of [Why do I see a double variable initialized to some value like 21.4 as 21.399999618530273?](http://stackoverflow.com/questions/177506/why-do-i-see-a-double-variable-initialized-to-some-value-like-21-4-as-21-3999996) – JackWhiteIII Jun 25 '15 at 14:37
  • This is a common situation. Floating point arithmetic has rounding errors which aggregate during each operation. You are observing the results of that behavior. – andand Jun 25 '15 at 14:39
  • Consider looking at http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html – andand Jun 25 '15 at 14:42
  • Is there a way to avoid this problem without loosing precision? I have to use this kind of normalization on a text retrieval and text mining, with term frequency and inverse document frequency, and i don't want to loose precision in my calculation. – Jacopo Terrinoni Jun 25 '15 at 14:53
  • Such lost of precision normally won't matter much for applications as text retrieval. If you really need precise result, I think the only way is to keep each number by fraction. All calculation will become so nasty. Note that your samples are just a small approximation subset of real word texts. The difference between your TF-IDF and real-world TF-IDF is greater than `0.0000000000000004`. Let along each model is not perfect. – Morrissss Jun 25 '15 at 15:02

1 Answers1

1

As indicated above, this is a common issue, one you're going to have to deal with if you're going to use floating point binary arithmetic. The problem mostly crops up when you want to compare two floating point binary numbers for equality. Since the operations applied to arrive at the values may not be identical, neither will their binary representations.

There are at least a couple strategies you can consider to deal with this situation. The first involves comparing the absolute difference between two floating point numbers, x and y rather than strict equality and comparing them to some small value ϵ>0. This would look something like

if (Math.abs(y-x) < epsilon) {
    // Assume x == y
} else {
    // Assume x != y
}

This works well when the possible values of x and y have a relatively tight bounding on their exponents. When this is not the case, value of x and y may be such that the difference always dominates the ϵ you choose (if the exponent is too large) or ϵ dominates the difference (such as when the possible exponents of x and y are small). To get around this, instead of comparing the absolute difference, you could instead compare the ratio of x and y to 1.0 and see whether that ratio has an absolute difference from 1.0 by more than ϵ. That would look like:

if (Math.abs(x/y-1.0) < epsilon) {
    // Assume x == y
} else {
    // Assume x != y
}

You will likely need to add another check to ensure y!=0 to avoid division by zero, but that's the general idea.

Other options include using a fixed point library for Java or a rational number library for Java. I have no recommendations for that, though.

andand
  • 17,134
  • 11
  • 53
  • 79