Regular expression: Sequence of optional but dependent matches without nesting groups

Question

I am writing a regular expression to match the strings MySQL accepts when assigning to a DATETIME column.

The regular expression needs to match these strings:

2016-07-12 06:32:54.0001
2016-07-12 06:32:54.
2016-07-12 06:32:54
2016-07-12 06:32:
2016-07-12 06:32
2016-07-12 06:
2016-07-12 06
2016-07-12 
2016-07-12

But should not match these strings:

2016-07-12 06:32:.0001
2016-07-12 06::54.0001
2016-07-12 :32:54.0001
2016-07-12 ::.

That is, every part after and including the middle space is optional, but each optional part depends on the previous part (the regex can only skip remaining parts to go straight to the end).

Currently I have:

/^
(\d+) # year
[[:punct:]]
(\d+) # month
[[:punct:]]
(\d+) # day
(?:
    (?:T|\s+|[[:punct:]]) # seperator between date and time
    (?:
        (\d+) # hour
        (?:
            [[:punct:]]
            (?:
                (\d+) # minute
                (?:
                    [[:punct:]]
                    (?:
                        (\d+) # second
                        (?:
                            \.
                            (\d+)? # microsecond
                        )?
                    )?
                )?
            )?
        )?
    )?
)?
$/xDs

Is there a way to avoid the deeply nested groups?

Thanks

If you need to use the captured values, there is no way to simplify this. If it is a PCRE regex, you can shorten it a tiny bit by replacing `[[:punct:]]` with `\p{P}`. — Wiktor Stribiżew, Jul 12 '16 at 08:59
Nested groups seems the most natural solution to your problem. Named capturing groups might make the regex clearer. Depending on your language you might prefer to write more than one regexp, and consume the tokens(by for example replacing them with an empty string) as you match them. — Taemyr, Jul 12 '16 at 09:07
This will allow you to separate into groups without nesting: `([^\-\s\.\:]{2,4})[\-\s\.\:]` the downside is you can't identify the wrong cases, maybe separating into a different Regex to check that there are no repeating delimiters. — Yaron, Jul 12 '16 at 09:27
What is the use case, BTW? I do not see the point in shortening this regex. It is a well-known fact that good, efficient regexps are complex and long. — Wiktor Stribiżew, Jul 12 '16 at 09:37
Why worry about "deep nesting"? The *logic* is nested, so it's reasonable that the *pattern* be nested too. It works. Leave it. — Bohemian, Jul 12 '16 at 12:15
If I had `if {` blocks nested that deep in my code I'd refactor it. I'm just asking if I can do the same for the regex. — Jesse, Jul 12 '16 at 12:54

score 1 · Accepted Answer · edited May 23 '17 at 12:14

The answer to your question - in case you want to preserve the captured values in the results - is No.

Why? Capturing repeated groups is not supported in PCRE. Moreover, if you use any kind of (?:(PATTERN_BLOCK)PATTERN_BLOCK2){n}, you will just get the final PATTERN_BLOCK value. If you thought about (?(DEFINE)....), the capturing groups inside that block are also reset later and you have no access to those values.

These optional groups with "nested" capturing groups is exactly what you need in this case.

Regular expression: Sequence of optional but dependent matches without nesting groups

1 Answers1