9

Are there any real differences between Ruby regex and Python regex?

I've been unable to find any differences in the two, but may have missed something.

Tim O
  • 731
  • 5
  • 11
  • hmm? what are you trying to "find"? regex itself is a language, so the library might have a bit different flags but overall the syntax is the same between everything that supports it. – OneOfOne Apr 15 '11 at 02:07
  • 1
    Ruby1.8 or Ruby1.9? There is a huge difference there. – sawa Apr 15 '11 at 02:08
  • 3
    See - http://www.regular-expressions.info/refflavors.html – YOU Apr 15 '11 at 02:08
  • Rather, Ruby1.9 and PHP5 should be the same because they adopt the same oniguruma engine. – sawa Apr 15 '11 at 02:16
  • I'm pretty sure that neither has good regex debugging support. – tchrist Apr 15 '11 at 03:44
  • A special case of this question: http://stackoverflow.com/questions/4644847/list-of-all-regex-implementations – sawa Apr 15 '11 at 04:05
  • Ruby has regex baked right into the language, much like perl, while python doesn't take that route (requires a library). While I like both ways, former wins from a usability point of view. Set of syntactic differences shown [here](http://langref.org/ruby+python/pattern-matching). One difference shown [in this question](http://stackoverflow.com/questions/13577372/do-python-regular-expressions-have-an-equivalent-to-rubys-atomic-grouping) – nawfal Jul 21 '14 at 18:23

5 Answers5

8

The last time I checked, they differed substantially in their Unicode support. Ruby in 1.9 at least has some very limited Unicode support. I believe one or two Unicode properties might be supported by now. Probably the general categories and maybe the scripts were the two I'm thinking of.

Python has less and more Unicode support at the same time. Python does seem to make it possible to meet the requirements of RL1.2a "Compatability Properties" from UTS#18 on Unicode Regular Expressions.

That said, there is a really rather nice Python library out there by Matthew Barnett (mrab) that finally adds a couple of Unicode properties to Python regexes. He supports the two most important ones: the general categories, and the script properties. It has some other intriguing features as well. It deserves some good publicity.

I don't think either of Ruby or Python support Unicode all that terribly well, although more and more gets done every day. In particular, however, neither meets even the barebones Level 1 requirement for Unicode Regular Expressions cited above. For example, RL1.2 requires that at least 11 properties be supported: General_Category, Script, Alphabetic, Uppercase, Lowercase, White_Space, Noncharacter_Code_Point, Default_Ignorable_Code_Point, ANY, ASCII, and ASSIGNED.

I think Python only lets you get to some of those, and only in a roundabout way. Of course, there are many, many other properties beyond these 11.

When you’re looking for Unicode support, there's more than just UTS#10 on Regular Expressions of course, although that is the one that matters most to this question and neither Ruby nor Puython are Level 1 compliant. Other very important aspects of Unicode include UAX#15, UAX#14, UTS#18, UAX#11, UAX#29, and of course the crucial UAX#44. Python has libraries for at least a couple of those, I know. I don't know that they're standard.

But when it comes to regular expression support, um, there are richer alternatives than just those two, you know. :)

dawg
  • 98,345
  • 23
  • 131
  • 206
tchrist
  • 78,834
  • 30
  • 123
  • 180
  • I think ruby regex support has become much more powerful since you last checked: https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc – steenslag Apr 15 '11 at 13:40
  • @steenslag No, Ruby regexes still suck at Unicode. Charclass abbreviations are still pitifully out of step with RL1.2a, stuck in the ASCII sands of yesteryear. Same with the POSIX props. And things like `\p{lower}` are in radical conflict with the Unicode Standard, which says it must be all lowercase, not just letters. Beyond that, only two properties are supported: General_Category and Script properties. There’s no support for grapheme clusters via `\X` or equiv. There’s no `\N{NAME}` support. It’s missing the rest of the stuff for Level 1, the lowest acceptable level of Unicode support. – tchrist Apr 15 '11 at 20:57
  • 1
    @steenslag: Consider this totally reasonable, and indeed very commonly needed, pattern for matching a grapheme cluster—a user-perceived character—that has "a" and a circumflex, but where you do not know the normalization form first, where you want fullwidth "a"’s and such to match, and where other marks can fall between them: `NFKD($s) =~ / (?= a \p{Grapheme_Extend}* \N{COMBINING CIRCUMFLEX ACCENT} ) \X /ix`. How am I do that in Ruby? Neither Ruby nor Python can even come close to meeting the **MINIMAL** requirements of [UTS#18 on Unicode Regexes](http://unicode.org/reports/tr18/). *See now?* – tchrist Apr 15 '11 at 21:03
  • I'm not a good discussion partner in this case- I had to wikipedia most of your key words. But what would you advise the OP, ruby or python ? – steenslag Apr 15 '11 at 22:58
  • 1
    @steenstag: Ruby or Python for *what*? Regular expressions? Both require what are to me unacceptable compromises. I have to be able to work with Unicode. – tchrist Apr 15 '11 at 23:34
5

I like the /pattern/ syntax in Ruby, inspired from Perl, for regular expressions. Python's re.compile("pattern") is not really elegant for me. The syntatic sugar in Ruby and the fact that regular expressions are a separate re module in Python, makes me lean towards Ruby when it comes to Regular Expressions.

Apart from this, I don't see much of a difference from a normal Regular Expression programming perspective. Both the languages have pretty comprehensive and mostly similar RE support. There might be performance differences ( Python traditionally has has better performance ) and also Python has greater unicode regular expressions support.

manojlds
  • 290,304
  • 63
  • 469
  • 417
  • How many of [the standard Unicode properties](http://unicode.org/reports/tr44/#Property_Index) does Python support? Also, how is Python’s support for [proper grapheme clusters](http://unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table) coming along, like via `\X` or perhaps through `\p{Grapheme_Base}\p{Grapheme_Extend}*`? Does it do full 1:many Unicode case folding for case insensitive matches? Can you reliably use any possible Unicode code point, or are you still hamstrung by that BMP restriction (which Unicode forbids, *ahem*)? BTW, I’m just ribbing you, don’t take it too seriously. – tchrist Apr 15 '11 at 03:35
  • 4
    I strongly agree with you that having regexes tightly coupled to the core language instead of nailed on the side with a library makes a really big difference in usabilty. – tchrist Apr 15 '11 at 03:41
3

If the question is only about regex's: neither. Use Perl.

You should choose between those languages based on the other non-regex issues that you are trying to solve and the community support in that language that is nearby your field of endeavor.

If you are truly only picking a language based on regex support -- choose Perl...

dawg
  • 98,345
  • 23
  • 131
  • 206
2

Ruby's Regexp#match method is equivalent to Python's re.search(), not re.match(). re.search() and Regexp#match look for the first match anywhere in a string. re.match() looks for a match only at the beginning of a string.

To perform the equivalent of re.match(), a Ruby regular expression will need to start with a ^, indicating matching the beginning of the string.

To perform the equivalent of Regexp#match, a Python regular expression will need to start with .*, indicating matching zero or more characters.

1

The regular expression libraries for Ruby and Python are developed by two completely independent teams. Even if they are identical now (and I wouldn't be certain they are), there's no guarantee that they won't diverge sometime in the future.

The safest position is to assume they're different now, and assume they will continue to be different in the future.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285