How can I write a regular expression that spots incorrect usage of a comma in a string, ie.: 1. for non-numbers, no space before and 1 space after; 2. for numbers, commas are allowed if preceded by 1-3 digits and followed by 3 digits.
Some test cases:
- hello, world
- hello,world => incorrect
- hello ,world => incorrect
- 1,234 worlds
- 1,23 worlds => incorrect
- 1,2345 worlds => incorrect
- hello,123 worlds => incorrect
- hello, 1234,567 worlds => incorrect
- hello, 12,34,567 worlds => incorrect
- (new test case) hello 1, 2, and 3 worlds
- (new test case) hello $1,234 worlds
- (new test case) hello $1,2345 worlds => incorrect
- (new test case) hello "1,234" worlds
- (new test case) hello "1,23" worlds => incorrect
So I thought I'd have a regex to capture words with bad syntax via (?![\S\D],[\S\D])
(capture where there's a non-space/digit followed by a comma by a non-space/digit), and join that with another regex to capture numbers with bad syntax, via (?!(.?^(?:\d+|\d{1,3}(?:,\d{3}))(?:.\d+)
. Putting that together gets me
preg_match_all("/(?![\S\D],[\S\D])|(?!(.*?^(?:\d+|\d{1,3}(?:,\d{3})*)(?:\.\d+)?$))/",$str,$syntax_result);
.. but obviously it doesn't work. How should it be done?
================EDIT================
Thanks to Casimir et Hippolyte's answer below, I got it to work! I've updated his answer to take care of more corner cases. Idk if the syntax I added is the most efficient, but it works, for now. I'll update this as more corner cases come up!
$pattern = <<<'LOD'
~
(?: # this group contains allowed commas
[\w\)]+,((?=[ ][\w\s\(\"]+)|(?=[\s]+)) # comma between words or line break
|
(?<=^|[^\PP,]|[£$\s]) [0-9]{1,3}(?:,[0-9]{3})* (?=[€\s]|[^\PP,]|$) # thousands separator
) (*SKIP) (*FAIL) # make the pattern fail and forbid backtracking
| , # other commas
~mx
LOD;