3

Background: I'm trying to incrementally parse expressions like "cos(1.2)". Now, to the actual question (note: that the actual question is mostly in the next paragraph; the rest is ramblings about solutions that seem to almost work):

Suppose I have a String in Java which might start with a floating point number, and then has some more "stuff" after it. For instance, I might have 52hi (which starts with "52", and ends with "hi"), or -1.2e1e9 (which starts with "-1.2e1", also known as "negative twelve" and ends with "e9"). I want to parse this number into a double.

It's tempting to use Double.parseDouble, but this method expects the string as a whole to be a valid number, and throws an exception if not. The obvious thing to do is write a regular expression to separate out the number from the other stuff, and then use parseDouble.

If I was parsing integers, this wouldn't be too bad, something like -?[0-9]+. (Even then, it's easy to forget an edge case and now your users are not able to enter +9 for symmetry with -9. So the preceding regex should have been [-+]?[0-9]+.) But for floats it's complicated; maybe something like this (ignore the fact that "." is not taken literally by default in most regex dialects):

[-+]?[0-9]*.?[0-9]*(e[-+]?[0-9]+)?.

Except we just said that an empty string is a valid number. And so is ".e2". So probably something a bit more complicated. Or maybe I could have a "sloppy" regex like above that allows some non-numbers as long as it doesn't forbid any actual numbers. But at some point I start thinking to myself "isn't this supposed to be parseDouble's job?". It's doing most of the work needed to find out where in the string the number ends and other stuff begins, because otherwise it wouldn't be able to throw the exception. Why should I have to do it as well?

So I started looking to see whether there was anything else in the Java standard library that could help. My usual tool of choice is java.util.Scanner, which has a nice nextDouble() method. But Scanner works on "tokens", so nextDouble really means "get the next token and try to parse it as a double". Tokens are separated by delimiters, which my default is whitespace. So Scanner would have no trouble with "52 hi", but wouldn't work with "52hi". In theory, the delimiter can be any regular expression I choose, so all I have to do is concoct a regular expression that, when it matches, signifies the end of a number. But this seems even harder to do than directly writing a regular expression.

I was about to give up hope when I found java.text.DecimalFormat, which explicitly says "I'll parse as far as I can, and I'll tell you how far I got so you can continue doing something else from that point". But it seems that it was primarily designed to format things for human consumption, and maybe parse things written by machines, but not to parse things written by humans, and it shows up in a bunch of little ways. For example, it "supports" scientific notation like "1.2e1", but if you use it, it will insist that the number must be in scientific notation and fail the parse if you enter "12" instead. One could try working around this by checking the spot where it failed and parsing just the stuff before that as a number, but this is error-prone and even more annoying than just writing a regex for floats.

Meanwhile in C, this would be simply sscanf("%f"), and C++ you can use a string stream to do basically the same thing. Is there really no equivalent in Java?

Mark VY
  • 1,489
  • 16
  • 31
  • 1
    You can copy&paste code from JDK which performs actual parsing and replace code that throws NumberFormatException with your code. Code for latest Java is [here](https://github.com/openjdk/jdk/blob/99bf89c581a4fa57e0cdfeeb2c09c9c7f9349e4a/src/java.base/share/classes/jdk/internal/math/FloatingDecimal.java#L1830). – vbezhenar Feb 27 '20 at 21:05
  • That would require understanding the code very well, to make sure I don't accidentally consume any extra text, or too little. (For instace, 12e-9 is a number, but 12e-q should be "12" and then "e-q"; this is where sscanf doesn't do so well.) My complaint is that I want to outsource the entire parsing job to Java's standard library, and I'm annoyed that I have to do part of it myself. This way means I have to do ALL of it myself. The program I'm writing is pretty small. The code for java's FloatingDecimal is more lines than the entire rest of my program. – Mark VY Feb 27 '20 at 21:18
  • 1
    @MarkVY: In C you should use `strtod`, not `sscanf`. `sscanf` has to mimic the behaviour of `fscanf`, and `fscanf` is defined in a way which works with non-rewindable streams (so it can only peek at the next character). `strtod` stores the end of the parsed number into an output parameter, and is defined as parsing the longest prefix which can be interpreted as a number, so it will get `12e-q` right (as well as many other corner cases). I don't know anything about Java, though :-( – rici Feb 28 '20 at 14:42
  • Thanks, I never knew that! – Mark VY Feb 29 '20 at 16:50

1 Answers1

4

The documentation for Double.valueOf(String) actually includes a regex that you can use to check whether a string is a double.

Here it is, without the comments:

final String Digits     = "(\\p{Digit}+)";
final String HexDigits  = "(\\p{XDigit}+)";
final String Exp        = "[eE][+-]?"+Digits;
final String fpRegex    =
        ("[\\x00-\\x20]*"+
                "[+-]?(" +
                "NaN|"+
                "Infinity|" +
                "((("+Digits+"(\\.)?("+Digits+"?)("+Exp+")?)|"+
                "(\\.("+Digits+")("+Exp+")?)|"+
                "((" +
                "(0[xX]" + HexDigits + "(\\.)?)|" +
                "(0[xX]" + HexDigits + "?(\\.)" + HexDigits + ")" +
                ")[pP][+-]?" + Digits + "))" +
                "[fFdD]?))" +
                "[\\x00-\\x20]*");

You can use this like this:

Matcher m = Pattern.compile(fpRegex).matcher(input);
if (m.find()) {
    String doublePartOnly = m.group();
}

Through some basic testing, I found that the regex is greedy, so it will match 1.2e1 in 1.2e1hello, as opposed to just 1.2.

Sweeper
  • 213,210
  • 22
  • 193
  • 313
  • Indeed it does have it in the docs, but if I put it in my code, I've failed to fully outsource numeric parsing to Java. Just to give one silly example: imagine I'd done this back in the Java 6 days. When Java got support for hex floats, my code still couldn't parse them, despite parseDouble handling them just fine. Maybe that's okay; hex floats are weird and I don't really care about them. But this example highlights the point that my code would be doing half the job (namely: where does the number end?) and Java would be doing the rest (namely: which double should we use for this String?) – Mark VY Feb 27 '20 at 20:59
  • 1
    The sub-problem that we need to solve here is "how do I know if a string can be parsed to a `Double`?". Once we know that, we can just keep substring-ing and see which substring can be parsed. According to [here](https://stackoverflow.com/questions/3133770/how-to-find-out-if-the-value-contained-in-a-string-is-double-or-not), the only way to solve the sub-problem, and also outsources the parsing to Java at the same time seems to be to `try...catch` an exception. @MarkVY – Sweeper Feb 27 '20 at 21:13
  • That's actually pretty clever, and I'm annoyed that I didn't think of that :) But yikes! That seems terribly inefficient. If I have a String that's a 100 characters long, and it starts with a number that's 10 characters long, then ideally I'd like to process just the first 11 characters or so to extract the number. This way involves (1) passing the whole string to parseDecimal and catching the exception, followed by (2) passing just the first 99 chars, and then (3) the first 98 chars, and so on. Eventually we'll have done O(n^2) work just creating all these substrings! Still clever :) – Mark VY Feb 27 '20 at 21:25