0

I am writing a parser where I need to parse numbers.

Numbers are in the JSON format, i.e. simplified, something like:

[+|-] integral [. fractional] [e|E exponent]

So these are the 3 parts of the number I have. Each part is an integer.

For example:

0.4
Integral = 0
Fractional = 4
Exponent = 0

2.84e+6
Integral = 2
Fractional = 84
Exponent = 6

I know how to compute the number in Java in a very crude manner. Something like this:

long integral;
long fractional;
int exp;

double decimal = fractional;
while (decimal >= 1.0) {
  decimal *= 0.1;
}

var n = (integral + decimal) * Math.pow(10, exp);

But this has terrible properties like losing precision completely, overflowing etc. specially for very large numbers (e.g. 10e99, 2.7e-23).

Is there a way to create a number in Java from its constituent parts such that the number keeps the properties as close as possible to the floating point literal?

My current idea is to try to create the exact bits in memory that would represent the floating point number using the IEEE 754 representation Java (and many languages) uses.

This blog post helped me understand the floating point representation, but I am still kind of stuck.

My current try to do that uses a ByteBuffer which can do:

ByteBuffer buffer = ByteBuffer.allocate(8);
buffer.put(<bits representing the number exactly in floating point>);

buffer.flip();
double n = buffer.getDouble();

The "only" problem being, figuring out the right bits, which is not trivial (if someone knows a method to do that , please do share).

EDIT: Performance is important, so I am trying to avoid creating a String then throwing it at Double.parseDouble(str)... as that feels pretty expensive compared to writing the bits into memory... but I might be wrong and this is indeed the most "efficient" option? Really don't know at this point, hopefully I can benchmark when I have a few options.

Renato
  • 12,940
  • 3
  • 54
  • 85
  • The numbers are in *textual* format? That's usually what JSON has. Why aren't you parsing them using `Double.parseDouble()`? – RealSkeptic May 04 '20 at 11:51
  • The numbers are already parsed into `long`s. The Java semantics are different, at least slightly, so even though I could put the longs back together into a String, then call `Double.parseDouble`, I feel like that's far from optimal. – Renato May 04 '20 at 11:53
  • Then why are you parsing them into longs? – RealSkeptic May 04 '20 at 11:54
  • What do you think I should parse them as? – Renato May 04 '20 at 11:55
  • 3
    They come from the JSon as text, why parse them into parts anyway? The text is a representation that can be understood by `Double.parseDouble()`. Instead, you run three `Long.parseLong()` essentially. I don't see why you are doing this. – RealSkeptic May 04 '20 at 11:58
  • Converting decimal numerals into binary floating-point is easy to do using elementary school arithmetic (long division and multiplication) but is wasteful of computing resources. Converting efficiently is hard and is the subject of academic papers. Do not try to do it yourself, use the built-in routines in `parseDouble`. The authors have, hopefully, read those papers and implemented an efficient algorithm. – Eric Postpischil May 04 '20 at 12:27
  • The parser is parsing bytes, not text. If I want to feed that into `Long.parseLong` I still need to create a String from the bytes (which I am not doing currently). Going directly from bytes to a number sounds feasible, having to instantiate a String before I can do that feels very wasteful. Why shouldn't I try to do it myself? I have lots of time to do it, I want to learn the details of how numbers are represented by doing it myself..Please don't tell me what I should do with my time. – Renato May 04 '20 at 12:32
  • @Renato: The reason why you are being directed to use existing code is you are not yet at a level where this problem can be tackled reasonably. To start with, you have an error in the framing of the problem: You say the starting point for this question is that you have three integers representing the integer, fractional part, and exponent. But, judging by the samples shown, that representation is **incapable** of distinguishing 3.123e9 from 3.00123e9, because they have the same representation, (3, 123, 9), and therefore will produce the same result, so one must be incorrect. – Eric Postpischil May 04 '20 at 13:33
  • If you want to start on the path to this problem, then start by studying floating-point arithmetic. The 1985 version of the IEEE-754 standard is fairly readable, and you can study a textbook such as [Handbook of Floating-Point Arithmetic](https://smile.amazon.com/Handbook-Floating-Point-Arithmetic-Jean-Michel-Muller/dp/3319765256/ref=sr_1_1?dchild=1&keywords=floating-point+arithmetic&qid=1588599286&sr=8-1). – Eric Postpischil May 04 '20 at 13:35
  • 1
    In any case, proper discussion of efficient algorithms is beyond what is normally feasible in Stack Overflow. As noted, this is a subject for academic papers. – Eric Postpischil May 04 '20 at 13:42
  • The classic academic paper is William Clinger, [*How to read floating point numbers accurately*](https://dl.acm.org/doi/10.1145/93548.93557) – rici May 04 '20 at 14:18
  • Also see David Gay [“Correctly rounded binary-decimal and decimal-binary conversions”](https://ampl.com/REFS/rounding.pdf), based on Clinger's work, which firms the basis for many actual implementations. Both of these papers date from 1990; there are some newer algorithms for the correct printing of floating point numbers but afaik David Gay's reading algorithm is still state of the art. – rici May 04 '20 at 14:32
  • Thank you for the pointers, but you assume that because there's a small error in framing the problem, I am not at a level I can tackle the problem. That's not acceptable in my opinion. You can't know the level I am at, I made a mistake in framing the question, that doesn't mean I should therefore use existing code and stop working on this. I have been thinking and reading about the problem for a few days and I think I will solve the problem eventually much better than "allocate a String and be done with it by calling Long.parseLong". I will post the answer when I get there. – Renato May 04 '20 at 14:45
  • @EricPostpischil I think you unfairly closed the question as the questions you say are duplicates start from the premise I have allocated a String to parse it. I haven't, and I don't accept allocating anything to solve this problem as that 's unnecessary in principle. My challenge is to parse a stream of bytes into a number without allocating objects. I might have made that clearer in the question, but the premise is: I have the components numerically. Not as a String. It's reasonably understood what I mean. – Renato May 04 '20 at 14:50
  • @renato: study the papers I linked, then. You can easily find open source implementations of David Gay's work in various languages, including his original C implementation, which might prove helpful as a guide. (I would have made those an answer but the question was closed before I saw it.) – rici May 04 '20 at 15:01
  • What fraction integer would you get for each of 0.4, 0.04, and 0.004? – Patricia Shanahan May 04 '20 at 15:25
  • The actual values I have are bytes. So `0.04` has fractional part `[48,52]` as the bytes are UTF8-encoded. Sorry for misrepresenting that as `long`. – Renato May 04 '20 at 15:30
  • @Renato: The answers in the linked questions relate or refer to the necessary algorithms. You are free to implement those algorithms without allocating a new `String`. – Eric Postpischil May 04 '20 at 15:35
  • Here's the answer if anyone needs it one day... no need to read 10 papers or whatever... https://gist.github.com/renatoathaydes/e015019c4283d6a54903a49d2887f59f – Renato May 04 '20 at 17:57
  • For my initial conditions of having the number arrive as a stream of bytes, which are typical of any parser, the really simple code shown in this snippet is ~30x~ 10x faster than converting to a String first, then parsing with `Float.parseFloat`. That's expected because numeric operations are extremely fast, but allocating Strings is not. – Renato May 04 '20 at 18:15

0 Answers0