5

I am looking at regexes to validate and parse well-known text, which is a format used to transfer spatial data and looks like:

POLYGON((51.124 -3.973, 51.1 -3.012, ....))

or

MULTIPOLYGON(((POLYGON((51.124 -3.973, 51.1 -3.012, ....)),POLYGON((50.14 -13.973, 51.1 -13.012, ....))

among other variations.

There is a good answer here: Parsing a WKT-file which uses the regex:

\d+(?:\.\d*)?

From other places I have also seen

\d*\.\d+|\d+

and

(\d*\.)?\d+

These all seem to do the same thing, but it got me wondering about the relative workings of these 3 regexes, and if there are any performance issues or subtleties under the hood to be aware of.

To be clear, I am aware that there are libraries for parsing WKT in various languages. My question is purely about the relative behavior of number extracting regexes.

Community
  • 1
  • 1
John Powell
  • 12,253
  • 6
  • 59
  • 67
  • Do you already know regex? i.e. you understand the different parts of each regex in your post to a basic level? – MDEV Mar 12 '14 at 15:26
  • Yes, I understand the notion of matching and non-matching groups, the meanings of the various quantifiers, etc, but feel I am missing something deeper. – John Powell Mar 12 '14 at 15:38

1 Answers1

6

It depends what number formats you need to allow, example:

 format 1:   22
 format 2:   22.2
 format 3:   .2
 format 4:   2.
  • the 1st pattern \d+(?:\.\d*)? matches 1,2,4
  • the 2nd pattern \d*\.\d+|\d+ matches 1,2,3
  • the 3rd pattern (\d*\.)?\d+ matches 1,2,3 (and have an uneeded capturing group)

Note: pattern 2 and 3 are slower to succeed than the first if the number is an integer, because they must match all digits until the dot, backtrack to the start and retry the same digits one more time. (see the schema below)

str  |  pattern       |  state
-----+----------------+-----------------------------
123  |  \d*\.\d+|\d+  |  START
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  FAIL => backtrack
123  |  \d*\.\d+|\d+  |  FAIL => backtrack
123  |  \d*\.\d+|\d+  |  FAIL => backtrack
123  |  \d*\.\d+|\d+  |  go to the next alternative
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK
123  |  \d*\.\d+|\d+  |  OK => SUCCESS

if you want to match the four cases, you can use:

\.\d+|\d+(?:\.\d*)?

(+) if the number doesn't begin with a dot, the first alternative fails immediatly and the second alternative will match all other cases. The backtracking is limited to the minimum.
(-) if you have few numbers that start with a dot the first alternative will be tested and will fail each times. However, the first alternative fails quickly.(in other words, for the same reason). In this case, it is better to use \d+(?:\.\d*)?|\.\d+

Obviously, if you want to support negative values you need to add -?:

-?(?:\.\d+|\d+(?:\.\d*)?)
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Thanks, that is exactly the kind of answer I was hoping for. – John Powell Mar 12 '14 at 17:07
  • WKT data can be very large, so speed does matter. However, lat/lon data can be represented like this, and London for a start, is in a zone where the number might start with a dot, though I've only ever seen it as 0.xxx, but definitely worth considering. – John Powell Mar 12 '14 at 17:16
  • @JohnBarça: in this case `0.xxx` the first pattern `\d+(\.\d*)?` is from far the best. – Casimir et Hippolyte Mar 12 '14 at 17:34
  • Can the final pattern be simplified to `-?(?:\d*(?:\.\d*)?)` or is there a specific reason we are separating the leading digit match into a separate character class match? – JLunda Feb 25 '22 at 18:37
  • @JLunda: no, because your pattern matches the empty string or `-` or `.` or `-.`. Other patterns ensure at least one digit. – Casimir et Hippolyte Feb 26 '22 at 17:33
  • @JLunda: note that `-?\d*\.?\d*\b` ensures at least one digit. (if word character doesn't follow). – Casimir et Hippolyte Feb 28 '22 at 08:20