-1

I am writing a regular expression for real numbers, why does it see symbols of degrees e E?

^[-+]?[0-9]*[.,]?([eE][-+][0-9])?[0-9]*$

Example:

-12.12
012,123
.123
,512
342.4E-1
--12.12
12
Rume One
  • 9
  • 1

1 Answers1

2

Your E-matching fragment comes before the 'fraction'-matching fragment — it is in the wrong place. That expression needs to be at the end. It should match 3E+1 OK as written, but that's not all that you want.

You have:

^[-+]?[0-9]*[.,]?([eE][-+][0-9])?[0-9]*$

It should be more like:

^[-+]?[0-9]*[.,]?[0-9]*([eE][-+]?[0-9]+)?$

Note that I added a + so that the exponent must have at least one and may have many digits. And I added a ? so that the exponent sign is optional.

Given the data file:

-12.12
012,123
.123
,512
342.4E-1
--12.12
12

running the expression using grep -E, I get the output:

$ grep -nE -e '^[-+]?[0-9]*[.,]?[0-9]*([eE][-+]?[0-9]+)?$' data
1:-12.12
2:012,123
3:.123
4:,512
5:342.4E-1
7:12
$

Using grep -E means that the RE is being interpreted as a POSIX Extended Regular Expression (ERE). The RE above matches an empty string. This is probably undesirable — but (as noted in a comment) can be fixed using an ERE:

^[-+]?([0-9]+[,.]?[0-9]*|[,.][0-9]+)([eE][-+]?[0-9]+)?$

The segment ([0-9]+[,.]?[0-9]*|[,.][0-9]+) looks for:

  • Either a string of one or more digits, optionally followed by a decimal point (either comma or dot) and zero or more fractional digits,
  • Or a string starting with a decimal point and followed by one or more fractional digits.

This requires at least one digit — it rules out empty lines (and oddball cases like E+12 being a valid number).

When your RE is matching the dash character in a character class, you need to be careful. A simple rule is to always place the dash after the start of the character class — after [ or [^. The other safe position is just before the ] at the end of the character class. Life gets trickier if you also need to match ] (and [) as part of the character class: you end up with things like:

[^][-]

which is a negated character class matching ], [ and -. Section §9.3.5 ¶7 of the POSIX specification (linked above) covers some of these points. Yes, considerable care is required.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • Just a remark. Since a dash is used in character class to get a range, it's safer to put it at the end, or backslash it. And if a regex has everything optional then it'll also match empty lines. Can be solved with a lookahead. F.e. `^(?=.*[0-9])[+-]*[0-9]*(?:[.,][0-9]+)?(?:[eE][+-]?[0-9]+)?$` – LukStorms Mar 18 '22 at 16:55
  • Agree that caution is needed when matching a dash, but putting it immediately after the `[` or `[^` is safe. Your 'avoid empty lines' regex uses PCRE notation — it is not clear from the question which dialect of RE is in use, but it fits in with a POSIX ERE (extended regular expression). It should be possible to stay with an ERE and use a variant of `([0-9]+[,.]?[0-9]*|[,.][0-9]+)`, I think, to demand at least one digit — but I haven't verified that regex against the possible inputs. – Jonathan Leffler Mar 18 '22 at 17:00