1

I am writing a regular expression to match the strings MySQL accepts when assigning to a DATETIME column.

The regular expression needs to match these strings:

2016-07-12 06:32:54.0001
2016-07-12 06:32:54.
2016-07-12 06:32:54
2016-07-12 06:32:
2016-07-12 06:32
2016-07-12 06:
2016-07-12 06
2016-07-12 
2016-07-12

But should not match these strings:

2016-07-12 06:32:.0001
2016-07-12 06::54.0001
2016-07-12 :32:54.0001
2016-07-12 ::.

That is, every part after and including the middle space is optional, but each optional part depends on the previous part (the regex can only skip remaining parts to go straight to the end).

Currently I have:

/^
(\d+) # year
[[:punct:]]
(\d+) # month
[[:punct:]]
(\d+) # day
(?:
    (?:T|\s+|[[:punct:]]) # seperator between date and time
    (?:
        (\d+) # hour
        (?:
            [[:punct:]]
            (?:
                (\d+) # minute
                (?:
                    [[:punct:]]
                    (?:
                        (\d+) # second
                        (?:
                            \.
                            (\d+)? # microsecond
                        )?
                    )?
                )?
            )?
        )?
    )?
)?
$/xDs

Is there a way to avoid the deeply nested groups?

Thanks

Jesse
  • 6,725
  • 5
  • 40
  • 45
  • 1
    If you need to use the captured values, there is no way to simplify this. If it is a PCRE regex, you can shorten it a tiny bit by replacing `[[:punct:]]` with `\p{P}`. – Wiktor Stribiżew Jul 12 '16 at 08:59
  • Nested groups seems the most natural solution to your problem. Named capturing groups might make the regex clearer. Depending on your language you might prefer to write more than one regexp, and consume the tokens(by for example replacing them with an empty string) as you match them. – Taemyr Jul 12 '16 at 09:07
  • This will allow you to separate into groups without nesting: `([^\-\s\.\:]{2,4})[\-\s\.\:]` the downside is you can't identify the wrong cases, maybe separating into a different Regex to check that there are no repeating delimiters. – Yaron Jul 12 '16 at 09:27
  • What is the use case, BTW? I do not see the point in shortening this regex. It is a well-known fact that good, efficient regexps are complex and long. – Wiktor Stribiżew Jul 12 '16 at 09:37
  • Why worry about "deep nesting"? The *logic* is nested, so it's reasonable that the *pattern* be nested too. It works. Leave it. – Bohemian Jul 12 '16 at 12:15
  • If I had `if {` blocks nested that deep in my code I'd refactor it. I'm just asking if I can do the same for the regex. – Jesse Jul 12 '16 at 12:54

1 Answers1

1

The answer to your question - in case you want to preserve the captured values in the results - is No.

Why? Capturing repeated groups is not supported in PCRE. Moreover, if you use any kind of (?:(PATTERN_BLOCK)PATTERN_BLOCK2){n}, you will just get the final PATTERN_BLOCK value. If you thought about (?(DEFINE)....), the capturing groups inside that block are also reset later and you have no access to those values.

These optional groups with "nested" capturing groups is exactly what you need in this case.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563