16

I've seen this question, and I know from experience that every language seems to support a different dialect of regex. I figure the problem has been around for a long time, so somebody must have wanted to do something about it.

I have a pretty big project that involves JavaScript, Ruby, and Java, and all of them have to touch the same regular expressions. We picked Java as our "official" RE interpreter, which means that any time the other two languages need to evaluate an RE, they have to somehow pass it to a Java program, and that's starting to add up to a lot of overhead.

If I could pick any RE dialect and invoke that at least semi-natively from all the languages, it'd be a huge step forward for us. Is this possible? Is it being done already? We looked at PCRE, and it's technically possible to invoke it via native bindings from Java and Ruby (though it leaves JS out in the cold), but I haven't found anybody actually doing it. Are we alone?

ETA: a wrinkle I did not mention is that this system applies user supplied regex. (Yes, I understand that this is a security issue, etc, but it's for in-house use by trusted, attributed users.) I can certainly suggest putting up a list of "don't do this" power-features to avoid, but I kind of hope it's not the best solution.

Community
  • 1
  • 1
Coderer
  • 25,844
  • 28
  • 99
  • 154
  • 1
    Ah, user-supplied regular expressions. I knew we were missing something ;) – BoltClock Dec 22 '11 at 00:20
  • (I just wanted to post the stopgap that we wound up using, in case anybody cares. I'd still love to hear something better, though.) We picked Java regex. We can run these natively from Ruby code, provided the Ruby is run in JRuby. For our purposes, that's good enough. We also wrote a Java servlet that basically runs a regex against test data, as a RESTful service. This takes care of the JavaScript end, though of course it's not pretty :-/ – Coderer Mar 07 '13 at 12:01

3 Answers3

11

The dialects that you implicitly mentioned in your post aren't THAT much different, there are things supported by one and not by the others, but that will normally not cause any problems unless you are writing regular expressions that actually specifically target one of the dialects in question.

You can see the differences between the dialects in the table available in the following link:


The major difference between them are the more "advanced" features of regular-expressions. If you keep away from using these, you'll be in the safe zone.


Since both python and java has modules available for executing native javascript you can say that all expressions should be written for javascript, and then make future developers use the module available to them, to make sure that the regexp ran always will operate exactly the same way.

Though I'd just document your application saying that whatever regular expressions used needs to be supported by all three languages, and then direct them to a table (such as the one previously linked) saying that they should look up what's available to use.

..or you ccould ompile a list/table of your own.

Filip Roséen - refp
  • 62,493
  • 20
  • 150
  • 196
  • Super awesome link, but I notice there are actually some *not* advanced features that will probably matter. The first thing that jumps out at me is `Hyphen in [\d-z] is a literal` -- that's not an uncommon syntax and I don't think you can write a character class, when you're talking about hyphens, that would work identically under both Java and Ruby. – Coderer Dec 22 '11 at 00:16
  • 1
    Can't see why anybody sane would write such a statement though? if you'd want to have hyphen as a char and not a range operator inside ` `[]` put it at the end, that's more standard. regarding it's use in ranged, being verbose is often better in a maintenance sense, I do not recommend people using [a-\d], for example. – Filip Roséen - refp Dec 22 '11 at 00:37
1

The dialects are all slightly different, but they overlap in almost all major points. (The main differences are in not in the regexes themselves, but in how you call them (one language's find is another's matches, and so on) and in support for regex literals (one language's // is another's raw string is another's string of backslashes).)

Rather than somehow getting JavaScript to support Java peculiarities and vice versa, I think it's probably better to restrict yourselves to the huge subset of regexes that are common between all three of your languages, and to use unit-tests to ensure that your regexes behave the same in all three.

ruakh
  • 175,680
  • 26
  • 273
  • 307
0

One (heavyweight) option would be to build a "regexp cross-compiler" that could accept as input a regular expression written in some canonical form (say, as a Perl regular expression), then would scan and parse it into a syntax tree and output equivalent regular expressions for other languages (say, Python or Java). This would let you write the regular expression once and have it work everywhere, since the compiler will do all of the work converting between formats.

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • Do I want to *write* this? Hell no. No way. But if you ever find this floating around somewhere, feel free to update your answer and I'll accept it! :D – Coderer Jul 20 '12 at 14:28