0

Context

I'm working on a Python-3.11 project and I'm having a difficult time understanding how float type works.

More specifically, I'm working on distances between data points, and I also have thresholds for these distances. But let's explain in order. I have a distance called threshold which is a numpy.float32. This distance is a distance between two arbitrary data points. I'll use this threshold as a threshold for other distances. But before using it, I floor it to the 10th decimal number:

display(threshold)
threshold_floored = math.floor(threshold * 10000000000)/10000000000
display(threshold_floored)

>>> output:
    0.16666667
    0.1666666716

I now use a clustering algorithm that creates clusters based on distance and uses threshold_floored as threshold. Points in cluster A have distance smaller than or equal to threshold_floored to points in cluster B. If for some reason the distance between a point in cluster A and a point in cluster B is bigger than or equal to threshold_floored, I print a sentence to notify me of this error.

Running my code I sometimes see the printed sentence, but when I check I get this:

display(threshold_floored)
display(distance_pointsAB)

>>> output:
    0.1666666716
    0.16666667

The distance is less than threshold_floored (but equal to threshold), but then why do I get the notification? BTW the notification code is this:

if distance_pointsAB > threshold_floored:
    print("Notification")

Problem

However I noticed the following things:

distance_pointsAB_floored = math.floor(distance_pointsAB * 10000000000)/10000000000

display(threshold)
display(threshold_floored)
display(distance_pointsAB)
display(distance_pointsAB_floored)

print("{0:.60f}".format(threshold))
print("{0:.60f}".format(threshold_floored))
print("{0:.60f}".format(distance_pointsAB))
print("{0:.60f}".format(distance_pointsAB_floored))

>>> output:
    0.16666667
    0.1666666716
    0.16666667
    0.1666666716

    0.166666671633720397949218750000000000000000000000000000000000 <---- threshold
    0.166666671600000010355913104831415694206953048706054687500000 <---- threshold_floored
    0.166666671633720397949218750000000000000000000000000000000000 <---- distance_pointsAB
    0.166666671600000010355913104831415694206953048706054687500000 <---- distance_pointsAB_floored

The notification now makes sense, because extending the decimals, distance_pointsAB is indeed bigger than threshold_floored.

However why does math.floor doesn't round threshold or distance_pointsAB to 0.166666671600000000000000000000000000000000000000000000000000?

And also, since my clustering algorithm should separate points in cluster A and cluster B if their distance is less than my threshold, and I used threshold_floored as criteria, why do I get that points in A and in B have distance bigger than the threshold? It seems that my clustering algorithm used threshold instead of threshold_floored. Am I right?

Is there a way to work properly with floats?

EDIT

I found the problem. The problem was that my threshold was a numpy.float32, and then I floored it converting it into a float. But then my clustering algorithm converted the threshold_floored again to numpy.float32, while the distance_pointsAB resulted in a float. The solution is a matter of setting properly value types.

Thank everybody for your advice!

SuperFluo
  • 1
  • 2
  • Use [`Decimal`](https://docs.python.org/3/library/decimal.html) – roganjosh Jan 17 '23 at 17:35
  • Depending on exactly what you're trying to do and which discrepancy you're concerned about, the preceding two comments comment may not be what you care about. Distances are best modeled, I think, as real numbers, and floating-point provides a decent approximation of real numbers, albeit with finite precision. But the finite precision is a certain number of *bits* on base 2, not digits in base 10. When you print finite-precision base-2 fractions out in decimal, they look weird. – Steve Summit Jan 17 '23 at 17:55
  • `numpy.float32` don't have 10 digits of precision, so you are trying to get a threshold precision that the datatype doesn't support. – John Coleman Jan 17 '23 at 17:57
  • @roganjosh: `Decimal` is not an appropriate type for working with distances between points. It may have some use with currencies that are measured in decimal units and with situations where humans want to work with decimal numerals, but it is not better than binary floating-point for working with points and geometry and is generally worse in performance and accuracy for a given precision. – Eric Postpischil Jan 17 '23 at 19:17
  • @EricPostpischil what do you propose as an alternative? – roganjosh Jan 17 '23 at 21:46
  • 1
    @roganjosh: I question OP’s need for quantizing the threshold at all. No reason for it is stated in the post; they merely abruptly state “But before using it [the threshold], I floor it to the 10th decimal number” without giving any reason why. A spatial clustering algorithm should not have any need of quantizing its threshold in decimal or otherwise. So the alternative is simple: Take the quanitization step out, and use the threshold value directly. – Eric Postpischil Jan 17 '23 at 21:58

0 Answers0