1

I am building a system to read tables from heterogeneous documents and would like to know the best way of managing (columns of) floating point numbers. Where the column can be represented as real numbers I will use List<Double> (I'm using Java but experience from other languages would be useful.) I also wish to serialize the table as a CSV file. Thus a table might look like:

"material", "mass (g)", "volume (cm3)",
"iron", 7.8, 1.0,
"aluminium", 27.3, 9.9,

and column2 (1-based) would be represented by a List<Double>

{new Double(7.8), new Double(27.3)} 

I may also wish to compute the density (mass/volume) and derive a new column ("density (g.cml-3)") as a List

{new Double(7.8), new Double(2.76)} 

However the input values are sometimes missing, unusual or represented by fuzzy concepts. Some transformations may throw exceptions (which I would catch and replace by one of the above). Examples include:

1.0E+10000
>10
10 / 0.0 (i.e. divide by zero)
Math.sqrt(-1.)
Math.tan(Math.PI/2.0)

I have the following options in Java for unusual values of a list element

  1. null reference
  2. Double.NaN
  3. Double.MAX_VALUE
  4. Double.POSITIVE_INFINITY

Are there protocols for when the Java unusual values above should be used? I have read this question on how they behave. (I would like to rely on chaining of their operations). And if there are protocols can the values be serialized and read back in? (e.g. does Java parse "0x7ff0000000000000L" to a number equal to Double.POSITIVE_INFINITY

I am prepared for some loss of precision in specification (there are often errors in OCR, missing digits etc. so this is a "good enough" exercise).

Community
  • 1
  • 1
peter.murray.rust
  • 37,407
  • 44
  • 153
  • 217
  • It's difficult to answer this without knowing what you want to do with the data. But the answer is probably a toss-up between null and NaN in the general case. – Oliver Charlesworth Feb 26 '13 at 09:48
  • @Oli agreed. There is no definitive use. In the first instance I would wish to display it and offer it for capture by others. If I cannot serialize NaN and read back in that alters the balance – peter.murray.rust Feb 26 '13 at 09:50
  • This is not directly related to your question but wouldn't it make more sense to represent each line by an object (then containing the null or NaN value) instead of a list for each column ? And for your question i would instinctively go for NaN but i don't think there's an established rule or good practive for that. – benzonico Feb 26 '13 at 09:54
  • @benzonico. Thanks. I would like to make the distinction between "null -> no value given" and "NaN -> value given but uninterpretable" – peter.murray.rust Feb 26 '13 at 09:57

1 Answers1

1

You have three problems that you ought to separate to some extent:

  1. What representation should you use for table entries, which might be numbers, numbered quantities of some units, or other things?

  2. How might floating-point infinities and NaNs serve you?

  3. How can floating-point objects be serialized (written to a file and read from a file)?

Regarding these:

  1. You have not specified enough information here for good advice about how to represent table entries. From what you describe, there is no reason to use floating point at all. This is because you have not specified what operations you want to perform on the entries other than reading and writing them. If you do not need to do arithmetic, there is no reason to bother converting values to floating point, or to any other number-arithmetic system. You could simply maintain the entries as their original text. This makes serialization trivial.

  2. Floating-point infinities act like mathematical infinity, by design. Infinity plus a number other than infinity remains infinity, et cetera. You should use floating-point infinities to represent mathematical infinities. You should avoid using floating-point infinities to represent overflows, unless you do not care about losing the values that overflow. Floating-point NaNs are intended to represent “not a number”. It is often used to represent something like “An error occurred, so we do not have a number here to give you. You should do something else in this place.” Then it is up to the application to supply the something else, perhaps by having supplementary information from another source or in a parallel data structure. Errors include things such as taking the square root of a negative number or failing to initialize some data. (E.g., some underlying software initializes floating-point data to NaNs, so that, if you do not initialize it yourself, NaNs remain.) You should generally treat NaNs as “empty places” that you must not use rather than as tokens representing something.

  3. When writing and reading floating-point values, you should take care to convert the values exactly or ensure that the errors you introduce in conversion are tolerable. If you must convert to text (human-readable numerals) rather than writing in “binary” (bytes with arbitrary values), then it may be preferable to write in a notation that uses a numeric base compatible with the native radix of the floating-point system (e.g., hexadecimal floating-point numerals for binary floating-point representations, such as 0x3.4p-2 for .8125). If this is not feasible, then you need to produce enough digits (when converting to decimal) to represent the floating-point value accurately enough to recover the original value when reading it, and you need to ensure the conversion software converts without introducing additional errors. You must also handle special values such as infinities and NaNs.

(Note that Math.tan(Math.PI/2) is not infinity and does not cause an exception because Math.PI/2 is not exactly π/2, so its tangent is finite, not infinity.)

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • Thanks, I agree the problem is somewhat undefined though it's a common one. Your analysis is very helpful. [NB I meant to write Math.PI/2.0 - of course it doesn't affect your answer's point]. – peter.murray.rust Feb 26 '13 at 18:02