Better regex syntax ideas

Question

I need some help to complete my idea about regexes.

Introduction

There was a question about better syntax for regexes on SE, but I don't think I'd use the fluent syntax. It's surely nice for newbies, but in case of a complicated regex, you replace a line of gibberish by a whole page of slightly better gibberish. I like the approach by Martin Fowler, where a regex gets composed of smaller pieces. His solution is readable, but hand-made; he proposes a smart way to build a complicated regex instead of a class supporting it.

I'm trying to make it to a class using something like (see his example first)

final MyPattern pattern = MyPattern.builder()
.caseInsensitive()
.define("numberOfPoints", "\\d+")
.define("numberOfNights", "\\d+")
.define("hotelName", ".*")
.define(' ', "\\s+")
.build("score `numberOfPoints` for `numberOfNights` nights? at `hotelName`");

MyMatcher m = pattern.matcher("Score 400 FOR 2 nights at Minas Tirith Airport");
System.out.println(m.group("numberOfPoints")); // prints 400

where fluent syntax is used for combining regexes extended as follows:

define named patterns and use them by enclosing in backticks
- `name` creates a named group
  - mnemonics: shell captures the result of the command enclosed in backticks
- `:name` creates a non-capturing group
  - mnemonics: similar to (?:...)
- `-name` creates a backreference
  - mnemonics: the dash connects it to the previous occurrence
redefine individual characters and use it everywhere unless quoted
- here only some characters (e.g., ~ @#%") are allowed
  - redefining + or ( would be extremely confusing, so it's not allowed
  - redefining space to mean any spacing is very natural in the example above
  - redefining a character could make the pattern more compact, which is good unless overused
  - e.g., using something like define('#', "\\\\") for matching backslashes could make the pattern much readable
redefine some quoted sequences like \s or \w
- the standard definitions are not Unicode conform
- sometimes you might have you own idea what a word or space is

The named patterns serves as a sort of local variables helping to decompose a complicated expression into small and easy to understand pieces. A proper naming pattern makes often a comment unnecessary.

Questions

The above shouldn't be hard to implement (I did already most of it) and could be really useful, I hope. Do you think so?

However, I'm not sure how it should behave inside of brackets, sometimes it's meaningful to use the definitions and sometimes not, e.g. in

.define(' ', "\\s")            // a blank character
.define('~', "/\**[^*]+\*/")   // an inline comment (simplified)
.define("something", "[ ~\\d]")

expanding the space to \s makes sense, but expanding the tilde doesn't. Maybe there should be a separate syntax to define own character classes somehow?

Can you think of some examples where the named pattern are very useful or not useful at all? I'd need some border cases and some ideas for improvement.

Reaction to tchrist's answer

Comments to his objections

Lack of multiline pattern strings.
- There are no multiline strings in Java, which I'd like to change, but can not.
Freedom from insanely onerous and error-prone double-backslashing...
- This is again something I can't do, I can only offer a workaround, s. below.
Lack of compile-time exceptions on invalid regex literals, and lack of compile-time caching of correctly compiled regex literals.
- As regexes are just a part of the standard library and not of the language itself, there's nothing what can done here.
No debugging or profiling facilities.
- I can do nothing here.
Lack of compliance with UTS#18.
- This can be easily solved by redefining the corresponding patterns as I proposed. It's not perfect, since in debugger you'll see the blowed up replacements.

I looks like you don't like Java. I'd be happy to see some syntax improvements there, but there's nothing I can do about it. I'm looking for something working with current Java.

RFC 5322

Your example can be easily written using my syntax:

final MyPattern pattern = MyPattern.builder()
.define(" ", "") // ignore spaces
.useForBackslash('#') // (1): see (2)
.define("address",         "`mailbox` | `group`")
.define("WSP",             "[\u0020\u0009]")
.define("DQUOTE",          "\"")
.define("CRLF",            "\r\n")
.define("DIGIT",           "[0-9]")
.define("ALPHA",           "[A-Za-z]")
.define("NO_WS_CTL",       "[\u0001-\u0008\u000b\u000c\u000e-\u001f\u007f]") // No whitespace control
...
.define("domain_literal",  "`CFWS`? #[ (?: `FWS`? `dcontent`)* `FWS`? #] `CFWS1?") // (2): see (1)
...
.define("group",           "`display_name` : (?:`mailbox_list` | `CFWS`)? ; `CFWS`?")
.define("angle_addr",      "`CFWS`? < `addr_spec` `CFWS`?")
.define("name_addr",       "`display_name`? `angle_addr`")
.define("mailbox",         "`name_addr` | `addr_spec`")
.define("address",         "`mailbox` | `group`")
.build("`address`");

Disadvantages

While rewriting your example I encountered the following issues:

As there are no \xdd escape sequences \udddd must be used
Using another character instead of backslash is a bit strange
As I prefer to write it bottom-up, I had to take your lines reverted
Without much idea what it does, I except myself having done some errors

On the bright side: - Ignoring spaces is no problem - Comments are no problem - The readability is good

And most important: It's plain Java and uses the existing regex-engine as is.

Java 1.7 has named captures. `(? ...)` defines one and `\k` backreferences it, just as in Perl. — tchrist, Feb 06 '11 at 18:38
I know... but they're usable usable for capturing only, not for defining subexpressions to be used later. — maaartinus, Feb 06 '11 at 18:41
maaartinus: You can’t use them later because Java doesn’t have a `(?(DEFINE)…)` block yet. That’s really all you need to make that work; then you could call them with `(?&name)`. I have longer examples of these [here in the second, longer example](http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326). They’re incredibly useful, but you should see that already in the BNF example I already gave. — tchrist, Feb 06 '11 at 18:46
Java doesn't have the (?(DEFINE)…) block, and I'm afraid it'll stay so for a very long time. However, it can be replaced by my syntax, as I show you. Expect my long answer in about two ours. — maaartinus, Feb 06 '11 at 18:51
That’s a pretty fine effort, I have to admit. Still, I see you didn’t come close to implementing the whole thing. **Do you understand why Java makes doing so impossible?** It’s because of the `(?&comment)` production. Also, UTS#18 compliance requires far more than just RL1.2a, you know. Java also fails to comply with RL1.1, RL1.2, and RL1.4, plus both the two strong recommendations. As for “not liking Java”, I try to be practical: I prefer things that actually work, and which aren’t a royal PITA. At the end of the day I’m [on Rob’s side](http://www.youtube.com/watch?v=5kj5ApnhPAE) on all this. — tchrist, Feb 06 '11 at 21:42
You mean the loop between `comment` and `ccontent`, right? Either it can be converted in java-regex by manually breaking the loop or not, I can't tell now, can you? I know nearly nothing about Unicode compliance, are you saying that redefining things like `\s`, `\w` and `\p{...}` is not enough? — maaartinus, Feb 06 '11 at 22:01
@maaartinus: Yes, that right: it’s the mutual recursion between `comment` and `ccontent`, defined by the RFC. Java disallows the pattern recursion which PCRE and Perl rely upon here. Regarding UTS#18, the biggest problem (apart from `\s`, `\w`, `\b`, and `\X`) is that Java only recognizes 3 of the 11 mandatory properties in `\p{…}`, let alone the other 101 missing ones. It’s really tough without `\p{Alphabetic}` or `\p{White_Space}` or `\p{Lowercase}`. Plus Java uses standards-mandated property names to mean something they **are not supposed to mean!** It’s a real mess. — tchrist, Feb 06 '11 at 22:24
Now, I think the mutual recursion can be eliminated. However, it's not my goal to write a regex for something as overcomplicated as this crazy RFC. It's no problem to redefine the `\p{`...`}` things and to define some more as long as the result can be expressed by a java-regex (which for character classes always holds). This could fix it for current java without waiting for Oracle to wake up. I only need to tune the syntax and use your expression -- this is allowed, isn't it? — maaartinus, Feb 06 '11 at 22:34
Course you may use [my patterns](http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261) for charclasses, but properties are much harder. You cannot define `\p{Alphabetic}` without `Other_Alphabetic`, which is a derived property,etc. You must use the other `*.txt` files from the UCD for this. And *mirabile visu* [Oracle has actually woken up](http://mail.openjdk.java.net/pipermail/i18n-dev/2011-January/thread.html) about this issue — after quite a bit of patient prodding. They’ve fixed RL1.1, but there is plenty left to be done. — tchrist, Feb 06 '11 at 22:38
You have a refreshing can-do attitude; nice to see. But tell me, how do you think you can eliminate the recursion? The RFC allows for nested comments! That requires a recursive pattern. For example, in Perl `s/\\((?:[^()]*+|(?0))*\\)//g` strips out nested parens and their contents from the target string. That’s the sort of pattern you need, and Java forbids it. This is just one example of why interpolation isn’t ever going to provide anything like the sort of power and flexibility which delayed evaluation of previously defined groups gives you. — tchrist, Feb 06 '11 at 22:44
Nested comments (and the whole RFC) are really funny, what did they smoked? So I was obviously wrong, but in this case the pattern in http://stackoverflow.com/questions/1579023/design-of-an-alternative-fluent-interface-for-regular-expressions must be wrong too, right? — maaartinus, Feb 06 '11 at 23:12
That’s Jeffrey Friedl’s RFC822 pattern. It is not an RFC5322 pattern, and the poster obviously doesn’t know the difference. — tchrist, Feb 06 '11 at 23:58
My fault, it's stated there. Let's forget Unicode for a while, and let's think about the other unclear points like if it's useful to distinguish syntactically between defining character classes and general pattern. — maaartinus, Feb 07 '11 at 00:21
The point isn't to define a character class. It's to define a pattern that can be nested within another one and called by name. — tchrist, Feb 07 '11 at 01:43
Right, but there are different kinds of patterns. Some of them like "[\\s\\p{Dash}]" are character classes and can be used to build other character classes, while others can not. A possibility to define the former is (sort of) needed for fixing the Unicode compatibility and useful anyway. Something like `defineCharclass("mysep", "[\\s\\p{Dash}]")` and `"(\\w+)\p{mysep}+(\\w+)"` may be handy, may not it? — maaartinus, Feb 07 '11 at 15:35

score 3 · Accepted Answer · answered Feb 06 '11 at 17:38

Named Capture Examples

Can you think of some examples where the named pattern are very useful or not useful at all?

In answer to your question, here is an example where named patterns are especially useful. It’s a Perl or PCRE pattern for parsing an RFC 5322 mail address. First, it’s in /x mode by virtue of (?x). Second, it separates out the definitions from the invocation; the named group address is the thing that does the full recursive-descent parse. Its definition follows it in the non-executing (?DEFINE)…) block.

   (?x)              # allow whitespace and comments

   (?&address)       # this is the capture we call as a "regex subroutine"

   # the rest is all definitions, in a nicely BNF-style
   (?(DEFINE)

     (?<address>         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

     (?<addr_spec>       (?&local_part) \@ (?&domain))
     (?<local_part>      (?&dot_atom) | (?&quoted_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                   \] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?&quoted_pair))
     (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

     (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
     (?<quoted_pair>     \\ (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
     (?<qcontent>        (?&qtext) | (?&quoted_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?&quoted_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
     (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
     (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            \x0d \x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [\x20\x09])
   )

I strongly suggest not reïnventing a perfectly good wheel. Start with becoming PCRE-compatible. If you wish to go beyond basic Perl5 patterns like the RFC5322-parser above, there’s always Perl6 patterns to draw upon.

It really, really pays to do research into existing practice and literature before haring off on an open-ended R&D mission. These problems have all long ago been solved, sometimes quite elegantly.

Improving Java Regex Syntax

If you truly want better regex syntax ideas for Java, you must first address these particular flaws in Java’s regexes:

Lack of multiline pattern strings, as demonstrated above.
Freedom from insanely onerous and error-prone double-backslashing, as also demonstrated above.
Lack of compile-time exceptions on invalid regex literals, and lack of compile-time caching of correctly compiled regex literals.
Impossible to change something like "foo".matches(pattern) to use a better pattern library, partly but not solely because of final classes that are not overridable.
No debugging or profiling facilities.
Lack of compliance with UTS#18: Basic Regular Expression support, the very most elementary steps necessary to make Java regexes useful for Unicode. They currently are not. They don’t even support Unicode 3.1 properties from a decade ago, which means you cannot use Java patterns for Unicode in any reasonable fashion; the basic building blocks are absent.

Of these, the first 3 have been addressed in several JVM languages, including both Groovy and Scala; even Clojure goes part-way there.

The second set of 3 steps will be tougher, but are absolutely mandatory. The last one, the absence of even the most basic Unicode support in regexes, simply kills Java for Unicode work. This is complety inexcusable this late in the game. I can provide plenty of examples if need be, but you should trust me, because I really do know what I’m talking about here.

Only once you have accomplished all these should you be worried about fixing up Java’s regexes so they can catch up with the current state of the art in pattern matching. Until and unless you take care of these past oversights, you can’t begin to look to the present, let alone to the future.

"*I strongly suggest not reïnventing a perfectly good wheel.*" - Unfortunately, your wheel -- unlike mine -- requires to redefine Java syntax and to re-write the whole regex-engine. But thx for your answer, I'll comment it more in detail later. — maaartinus, Feb 06 '11 at 18:03
@maaartinus: My third point about caching you can somewhat live without, but the first two are really important; the other languages that compile into JVM do manage all three. For the last point about Unicode, you *could* use [a rewrite library like this](http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions/4307261#4307261) to help make Java’s patterns more compliant with [UTS#18: Unicode Regular Expression’s requirement #RL1.2a](http://unicode.org/reports/tr18/#Compatibility_Properties), but they still lack the rest of RL1.2, and much more. — tchrist, Feb 06 '11 at 19:13
Why should I need caching? With `private final Pattern ...` it gets compiled once and may be reused. I could also add a cache for dynamically created regexes, sure it would be quite verbose, but that's the Java way. My single biggest trouble with regexes is their unreadability -- that's what I'm trying to solve and consider it more important than any of your points. Your mileage may vary, but many programmers do not need the advanced concepts or strict Unicode conformance (don't even know they exists). — maaartinus, Feb 07 '11 at 00:38

score 1 · Answer 2 · answered Feb 07 '11 at 02:30

I think that perhaps a Regular Expression isn't really what is desired after-all, but rather something such as a Parser-Combinator library (that can work on characters and/or include regular-expressions within it's constructs).

That is, step beyond the realm of regular expressions (as irregularly as they may be implemented -- tchrist definitely enjoys the Perl implementation ;-) and into context-free grammars -- or at least those that can represented in LL(n), preferably with minimal backtracking.

Scala: The Magic Begind Parse-Combinators Note how it looks quite similar to BCNF. Has a nice introduction.

Haskel: Parsec Ditto.

Some examples in Java are JParsec and JPC.

Java, as a language, however, is not as conducive to such seamless DSL extensions as some competitors ;-)

I don't like this DSL much. It's very verbose, especially in Java. However, thx for the pointers. — maaartinus, Mar 30 '11 at 10:29