6

How can I write a regular expression that spots incorrect usage of a comma in a string, ie.: 1. for non-numbers, no space before and 1 space after; 2. for numbers, commas are allowed if preceded by 1-3 digits and followed by 3 digits.

Some test cases:

  • hello, world
  • hello,world => incorrect
  • hello ,world => incorrect
  • 1,234 worlds
  • 1,23 worlds => incorrect
  • 1,2345 worlds => incorrect
  • hello,123 worlds => incorrect
  • hello, 1234,567 worlds => incorrect
  • hello, 12,34,567 worlds => incorrect
  • (new test case) hello 1, 2, and 3 worlds
  • (new test case) hello $1,234 worlds
  • (new test case) hello $1,2345 worlds => incorrect
  • (new test case) hello "1,234" worlds
  • (new test case) hello "1,23" worlds => incorrect

So I thought I'd have a regex to capture words with bad syntax via (?![\S\D],[\S\D]) (capture where there's a non-space/digit followed by a comma by a non-space/digit), and join that with another regex to capture numbers with bad syntax, via (?!(.?^(?:\d+|\d{1,3}(?:,\d{3}))(?:.\d+). Putting that together gets me

preg_match_all("/(?![\S\D],[\S\D])|(?!(.*?^(?:\d+|\d{1,3}(?:,\d{3})*)(?:\.\d+)?$))/",$str,$syntax_result);

.. but obviously it doesn't work. How should it be done?

================EDIT================

Thanks to Casimir et Hippolyte's answer below, I got it to work! I've updated his answer to take care of more corner cases. Idk if the syntax I added is the most efficient, but it works, for now. I'll update this as more corner cases come up!

$pattern = <<<'LOD'
~
(?: # this group contains allowed commas
    [\w\)]+,((?=[ ][\w\s\(\"]+)|(?=[\s]+))  # comma between words or line break
  |
    (?<=^|[^\PP,]|[£$\s]) [0-9]{1,3}(?:,[0-9]{3})* (?=[€\s]|[^\PP,]|$) # thousands separator
) (*SKIP) (*FAIL) # make the pattern fail and forbid backtracking
| , # other commas
~mx
LOD;
Alex
  • 3,491
  • 4
  • 15
  • 15
  • May I ask why you are trying to do this? – Sverri M. Olsen Dec 08 '13 at 00:36
  • A possible practical application: A paragraph of text is extracted by an OCR software and I want to make sure the syntax is okay before storing the text, using this application as one of the grammar checks. (In my case the "OCR software" is an inexpensive contractor from a non-English speaking country who's typing up text from a copied pdf file) – Alex Dec 08 '13 at 00:45

1 Answers1

3

It isn't waterproof, but this can give you an idea on how to proceed:

$pattern = <<<'LOD'
~
(?: # this group contains allowed commas
    \w+,(?=[ ]\w+)  # comma between words
  |
    (?<=^|[^\PP,]|[£$\s]) [0-9]{1,3}(?:,[0-9]{3})* (?=[€\s]|[^\PP,]|$) # thousands separator
) (*SKIP) (*FAIL) # make the pattern fail and forbid backtracking
| , # other commas
~mx
LOD;

preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE);

print_r($matches[0]);

The idea is to exclude allowed commas from the match result to only obtain incorrect commas. The first non-capturing group contains a kind of blacklist for correct situations. (You can easily add other cases).

[^\PP,] means "all punctuation characters except ,", but you can replace this character class by a more explicit list of allowed characters, example : [("']

You can find more informations about (*SKIP) and (*FAIL) here and here.

Community
  • 1
  • 1
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Thanks! This worked brilliantly. And more importantly, thanks for the lesson (WOW there's a lot left for me to learn about regex..) – Alex Dec 08 '13 at 01:20
  • Sorry - I realized that I need to add cases for the "blacklist" to allow strings like "I have 1, 2, or 3 apples" but I don't know how to correctly add cases.. – Alex Dec 08 '13 at 06:30
  • @AlexChang: this case is already supported with `\w+,(?=[ ]\w+)` since the `\w` class contains digits too. – Casimir et Hippolyte Dec 08 '13 at 10:58
  • Sorry for following up with so many questions. Now I see that `\w` takes care of the case I wrote in the last comment, but another corner case is to take care of dollar amounts (ie. $12,345 is okay) and quotation marks before the numeric value. Also `\h` isn't taught on the regex cheatsheet (http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/ - super useful for regex beginners!) so I don't know what it means, and `\h` is also hard to google/bing.. :P – Alex Dec 08 '13 at 23:48
  • @Alex: `\h` is a character class that contains any horizontal white spaces (i.e. spaces and tabs). About the `$` you can easily add it as an alternative inside the lookbehind (don't forget to escape it or add it inside a class `[$£]`). About quotation marks, they are inside the `[^\PP,]` class. – Casimir et Hippolyte Dec 09 '13 at 00:05
  • Thanks! I think I finally got what's going on. – Alex Dec 09 '13 at 05:12