1

Is there a Pythonic way to validate whether a string represents a floating-point number (any input that would be recognizable by float(), e.g. -1.6e3), without converting it (and, ideally, without resorting to throwing and catching exceptions)?

Previous questions have been submitted about how to check if a string represents an integer or a float. Answers suggest using try...except clauses together with the int() and float() built-ins, in a user-defined function.

However, these haven't properly addressed the issue of speed. While using the try...except idiom for this ties the conversion process to the validation process (to some extent rightfully), applications that go over a large amount of text for validation purposes (any schema validator, parsers) will suffer from the overhead of performing the actual conversion. Besides the slowdown due to the actual conversion of the number, there is also the slowdown caused by throwing and catching exceptions. This GitHub gist demonstrates how, compared to user-defined validation only, built-in conversion code is twice as costly (compare True cases), and exception handling time (False time minus True time for the try..except version) alone is as much as 7 validations. This answers my question for the case of integer numbers.

Valid answers will be: functions that solve the problem in a more efficient way than the try..except method, a reference to documentation for a built-in feature that will allow this in the future, a reference to a Python package that allows this now (and is more efficient than the try..except method), or an explanation pointing to documentation of why such a solution is not Pythonic, or will otherwise never be implemented. Specifically, to prevent clutter, please avoid answers such as 'No' without pointing to official documentation or mailing-list debate, and avoid reiterating the try..except method.

Community
  • 1
  • 1
Yuval
  • 3,207
  • 32
  • 45
  • 1
    Have you seen this yet? http://stackoverflow.com/questions/736043/checking-if-a-string-can-be-converted-to-float-in-python – John Apr 04 '16 at 15:13
  • 1
    I'm sorry, but I have to ask - have you measured that this check is a bottleneck in your application? – Łukasz Rogalski Apr 04 '16 at 15:13
  • I have, though admittedly I didn't read it to the end. However, the `partition()` approach doesn't work with exponents (though I might be able to make it work), and the accepted answer there is the `try..except` code. – Yuval Apr 04 '16 at 15:15
  • @Rogalski - nope. And it probably isn't. But that's not the question. – Yuval Apr 04 '16 at 15:16
  • you *could* probably use a regular expression... or whatever parsing functionality Python uses to determine if a static number is a floating point or not. Don't know if that's more efficient than try/except, though – Wayne Werner Apr 04 '16 at 15:20
  • 2
    @Yuval But it IS the question, your question supposes the try-except approach causes too much overhead. Without actual messuring of that supposed overhead the entire question has no sense – Mr. E Apr 04 '16 at 15:21
  • @Mr.E - see the case for `int()`. – Yuval Apr 04 '16 at 15:22

4 Answers4

4

As @John mentioned in a comment, this appears as an answer in another question, though it is not the accepted answer in that case. Regular expressions and the fastnumbers module are two solutions to this problem.

However, it's duly noted (as @en_Knight did) that performance depends largely on the inputs. If expecting mostly valid inputs, then the EAFP approach is faster, and arguably more elegant. If you don't know what to input to expect, then LBYL might be more appropriate. Validation, in essence, should expect mostly valid inputs, so it's more appropriate for try..except.

The fact is, for my use case (and as the writer of the question it bears relevance) of identifying types of data in a tabular data file, the try..except method was more appropriate: a column is either all float, or, if it has a non-float value, from that row on it's considered textual, so most of the inputs actually tested for float are valid in either case. I guess all those other answers were on to something.

Back to answer, fastnumbers and regular expressions are still appealing solutions for the general case. Specifically, the fastnumbers package seem to be working well for all values except for special ones, such as Infinity, Inf and NaN, as demonstrated in this GitHub gist. The same goes for the simple regular expression from the aforementioned answer (modified slightly - removed the trailing \b as it would cause some inputs to fail):

^[-+]?(?:\b[0-9]+(?:\.[0-9]*)?|\.[0-9]+)(?:[eE][-+]?[0-9]+\b)?$

A bulkier version, that does recognize the special values, was used in the gist, and has equal performance:

^[-+]?(?:[Nn][Aa][Nn]|[Ii][Nn][Ff](?:[Ii][Nn][Ii][Tt][Yy])?|(?:\b[0-9]+(?:\.[0-9]*)?|\.[0-9]+)(?:[eE][-+]?[0-9]+\b)?)$

The regular expression implementation is ~2.8 times slower on valid inputs, but ~2.2 faster on invalid inputs. Invalid inputs run ~5 times slower than valid ones using try..except, or ~1.3 times faster using regular expressions. Given these results, it means it's favorable to use regular expressions when 40% or more of expected inputs are invalid.

fastnumbers is merely ~1.2 times faster on valid inputs, but ~6.3 times faster on invalid inputs.

Results are described in the plot below. I ran with 10^6 repeats, with 170 valid inputs and 350 invalid inputs (weighted accordingly, so the average time is per a single input). Colors don't show because boxes are too narrow, but the ones on the left of each column describe timings for valid inputs, while invalid inputs are to the right.

Timings of methods to validate whether a string holds a valid float value, according to whether inputs are valid or invalid

NOTE The answer was edited multiple times to reflect on comments both to the question, this answer and other answers. For clarity, edits have been merged. Some of the comments refer to previous versions.

Community
  • 1
  • 1
Yuval
  • 3,207
  • 32
  • 45
  • If regular expressions are a solution, does that mean you're looking for a restricted subset of floats? I find it hard to believe there's a regular expression that is powerful enough to capture floating point conversion in general (True and "1.4" but not "True") – en_Knight Apr 04 '16 at 15:40
  • 1
    @en_Knight: regular expressions are plenty powerful enough in this situation. (Again, I have no idea what you're doing with `True`; it's not relevant.) See [here](https://github.com/python/cpython/blob/2.7/Lib/decimal.py#L5901-L5919) for what Python's `decimal` module considers acceptable, which is very close in syntax to what `float` accepts. – Mark Dickinson Apr 04 '16 at 16:47
  • @MarkDickinson the question asks "any input that would be recognizable by float()" . Since float(True) returns 1.0, your implementation should do the same to satisfy this requirement, right? The float method also converts hex, which the above regex appears not to do properly, if I'm using it correctly – en_Knight Apr 04 '16 at 17:09
  • @Yuval what do you mean by X times faster - is this on cases when it passes or when it fails? Try/catch are extremely efficient when no exception is thrown; what distribution of cases did you use? Are you checking the worst case of each (worst case for a regex seems tough to find)? – en_Knight Apr 04 '16 at 17:31
  • @en_Knight - you can look at the gist (linked from the answer) for the list of inputs. It's a mix of valid and invalid inputs. 'X times faster' means that if the `try..except` method takes 4.5 seconds, the regex method takes 3 seconds (for the same inputs), and the fastnumbers method takes about 1 second. Re hex input, I'm unaware of and special handling in `float()`. Are you refering to `float.fromhex()`? – Yuval Apr 04 '16 at 22:06
  • @Yuval 1. okay sounds fair; note it might be a pretty biased unit test. When I test ["1"]*10000 as an input, the try/catch is 1.5x faster than the regex approach. It depends what your input is; if you expect a lot of misses, then I believe you that the regex is faster - if you expect only a few misses, then it seems like a misleading test. 2. float(0x0) return 0.0 . If you're trying to get identical behaviour to the try/catch, the regex isn't handling that – en_Knight Apr 04 '16 at 22:23
  • 1
    @en_Knight `>>> float('0x0')` --> `ValueError: invalid literal for float(): 0x0`. 0x0 without quotes is a number literal, which isn't relevant to this question as an input. – Yuval Apr 05 '16 at 06:50
0

If being pythonic is a justification then you should just stick to The Zen of Python. Specifically to this ones:

Explicit is better than implicit.

Simple is better than complex.

Readability counts.

There should be one-- and preferably only one --obvious way to do it.

If the implementation is hard to explain, it's a bad idea.

All those are in favour of the try-except approach. The conversion is explicit, is simple, is readable, is obvious and easy to explain

Also, the only way to know if something is a float number is testing if it's a float number. This may sound redundant, but it's not

Now, if the main problem is speed when trying to test too much supposed float numbers you could use some C extensions with cython to test all of them at once. But I don't really think it will give you too much improvements in terms of speed unless the amount of strings to try is really big

Edit:

Python developers tend to prefer the EAFP approach (Easier to Ask for Forgiveness than Permission), making the try-except approach more pythonic (I can't find the PEP)

And here (Cost of exception handlers in Python) is a comparisson between try-except approach against the if-then. It turns out that in Python the exception handling is not as expensive as it is in other languages, and it's only more expensive in the case that a exception must be handled. And in general use cases you won't be trying to validate a string with high probability of not being actually a float number (Unless in your specific scenario you have this case).

Again as I said in a comment. The entire question doesn't have that much sense without a specific use case, data to test and a measure of time. Just talking about the most generic use case, try-except is the way to go, if you have some actual need that can't be satisfied fast enough with it then you should add it to the question

Community
  • 1
  • 1
Mr. E
  • 2,070
  • 11
  • 23
  • I agree,that there is benefit to making conversion and validation one, and I mentioned that. But you're confusing *testing* whether it's a floating point number and *converting it* to one, and those are distinct, and, in some cases, the user may benefit from validation without conversion, as I've pointed out. Your answer doesn't explain why validation wouldn't be Pythonic. In fact, a try-except approach is more complex, in the case of validation, than an `isfloat()` built-in. Such an implementation is easy to explain, more readable, more obvious and more explicit. – Yuval Apr 04 '16 at 16:02
  • While I disagree that specific use-cases are relevant, as this is such a prevalent problem, I've mentioned validating large text files against a schema. Specifically, I'm creating a script to deduce column data types from a tabular data file, so that they may be imported correctly and efficiently into a database. – Yuval Apr 04 '16 at 16:31
  • @Yuval Just updated my answer. And while conversion and validation is not the same. The `isFloat` as a built-in may be simpler, but it's implementation is not easy nor readable. If you want that implementation it's probably inside the `float()` function – Mr. E Apr 04 '16 at 16:32
  • @Yuval In the [Python github repo, under cpython/Objects/floatobject.c](https://github.com/python/cpython/blob/c797daf69edc52385ba78447441e1a65c7cf5730/Objects/floatobject.c#L128) you have how conversion to float is implemented. You can take from there how to do it, I haven't found any module that does it the way you want. So you have those 2 options, implement it yourself using or not cython (For performance reasons) or just stick to the try-except – Mr. E Apr 04 '16 at 16:51
-1

To prove a point: there's not that many conditions that a string has to abide by in order to be float-able. However, checking all those conditions in Python is going to be rather slow.

ALLOWED = "0123456789+-eE."
def is_float(string):
    minuses = string.count("-")
    if minuses == 1 and string[0] != "-":
        return False
    if minuses > 1:
        return False

    pluses = string.count("+")
    if pluses == 1 and string[0] != "+":
        return False
    if pluses > 1:
        return False

    points = string.count(".")
    if points > 1:
        return False

    small_es = string.count("e") 
    large_es = string.count("E")
    es = small_es + large_es
    if es > 1:
        return False
    if (es == 1) and (points == 1):
        if small_es == 1:
            if string.index(".") > string.index("e"):
                return False
        else:
            if string.index(".") > string.index("E"):
                return False

    return all(char in ALLOWED for char in string)

I didn't actually test this, but I'm willing to bet that this is a lot slower than try: float(string); return True; except Exception: return False

acdr
  • 4,538
  • 2
  • 19
  • 45
-1

Speedy Solution If You're Sure You Want it

Taking a look at this reference implementation - the conversion to float in python happens in C code and is executed very efficiently. If you really were worried about overhead, you could copy that code verbatim into a custom C extension, but instead of raising the error flag, return a boolean indicating success.

In particular, look at the complicated logic implemented to coerce hex into float. This is done in the C level, with a lot of error cases; it seems highly unlikely there's a shortcut here (note the 40 lines of comments arguing for one particular guarding case), or that any hand-rolled implementation will be faster while preserving these cases.

But... Necessary?

As a hypothetical, this question is interesting, but in the general case one should try to profile their code to ensure that the try catch method is adding overhead. Try/catch is often idiomatic and moreover can be faster depending on your usage. For example, for-loops in python use try/catch by design.

Alternatives and Why I Don't Like Them

To clarify, the question asks about

any input that would be recognizable by float()

Alternative #1 -- How about a regex

I find it hard to believe that you will get a regex to solve this problem in general. While a regex will be good at capturing float literals, there are a lot of corner cases. Look at all the cases on this answer - does your regex handle NaN? Exponentials? Bools (but not bool strings)?

Alternative #2: Manually Unrlodded Python Check:

To summarize the tough cases that need to be captured (which Python natively does)

I also would point you to the case below floating points in the language specification; imaginary numbers. The floating method handles these elegantly by recognizing what they are, but throwing a type error on the conversion. Will your custom method emulate that behaviour?

Community
  • 1
  • 1
en_Knight
  • 5,301
  • 2
  • 26
  • 46
  • Most of this answer should have been posted as a comment. The reference implementation reference is useful, thanks. – Yuval Apr 04 '16 at 15:34
  • Hmm which part? Not using regex seems directly relevant to my answer, since I've wrote parsers before and "use a regex" is sometimes a useful answer, but really doesn't apply here. There's a lot of discussion on SO whether "profile; it isn't important" is a valid part of an answer, but I feel my answer would be incomplete for future viewers if I didn't include it. For you, you may know this bottleneck applies - I'm less sure for the next person reading this answer – en_Knight Apr 04 '16 at 15:38
  • I have demonstrated the relevance of validation vs. conversion for `int()` in the body of the question. In lack of an alternative implementation (at the time of writing the question), there is no ability to compare one implementation to another. In my answer (which someone downvoted without explaining why, which I found surprising because -- while perhaps being incomplete, is far from unuseful), I added some benchmark timings. – Yuval Apr 04 '16 at 16:34
  • What does "Hex matching" mean here? Python's `float` constructor doesn't accept strings in any sort of hexadecimal notation. Same with booleans: `float("True")` isn't valid. (`float(True)` *is*, but that doesn't seem relevant when the OP is specifically asking about strings.) – Mark Dickinson Apr 04 '16 at 16:39
  • I humbly suggest, to improve your answer: I would rephrase 'please profile', to something like, 'does your use-case call for such optimizations'. Specifically, I demonstrated in my question that invalid inputs are relevant and cause the `int` version to run much slower. I would remove all other parts of your answer besides that and pointing to the reference source code, as it the rest is purely speculative in the sense that 'I don't think there is a better solution', and corresponds with the answer I offered (suggesting regexes) rather than my question. – Yuval Apr 04 '16 at 17:12
  • Fair enough, I'll make some edits in a bit (probably leaving in more than you'd like, but I'll see how much I can part with :) ). @MarkDickinson The question opens with "any input that would be recognizable by float()"" and that's what I'm answering; the float method *does* accept hex numbers, though not strings with hex in them. I'm answering: I want the same behaviour as try:float(thing);except:pass which is very hard to do without calling that method – en_Knight Apr 04 '16 at 17:23
  • @Yuval hope that's better - I put the "answer" part at the top, and the argument against alternatives as an afterthought. I think that saying why other approaches are not preferable is a valid part of an answer, especially when they seem like good ideas in subcases, but you're welcome to disagree and it is, after all, your question :) – en_Knight Apr 04 '16 at 17:27
  • 1
    Thanks @en_Knight. The question is about strings, as its title suggests. Noting that `float()` accepts other input types is useful - though one can easily factor this in with `instanceof()`. Speculation is good, but not in an answer - better to leave your doubts on regexes to actual suggestions for one (like the one I wrote). The section Alternative #2 is incoherent: link is to something else(?), it's unclear what 'Hex matching' is, and what special treatment imaginary number strings receive. The rest repeats Alternative #1, and though a good summary, best left as a comment for the question. – Yuval Apr 04 '16 at 21:56