225

When exploring regular expressions (otherwise known as RegEx-es), there are many individuals who seem to see regular expressions as the Holy Grail. Something that looks so complicated - just must be the answer to any question. They tend to think that every problem is solvable using regular expressions.

On the other hand, there are also many people who try to avoid regular expressions at all cost. They try to find a way around regular expressions and accept additional coding just for the sake of it, even if a regular expressions would be a more compact solution.

Why are regular expressions considered so controversial? Is there widespread misunderstandings about how they work? Or could it be a broad belief that regular expressions are generally slow?

Student
  • 1,197
  • 4
  • 22
  • 39
Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • 10
    if this is a discussion, then shouldn't it be closed? but i see a real question in there so maybe the discussion tag doesn't belong? – RCIX Jun 26 '09 at 22:24
  • 6
    No kidding. You bring it up and people start getting all crazy around here. – Ryan Florence Jul 11 '09 at 04:59
  • 1
    Nice observation and wording in the question! – imz -- Ivan Zakharyaschev Jan 28 '11 at 19:49
  • Also see http://programmers.stackexchange.com/questions/223634/what-is-meant-by-now-you-have-two-problems – Pacerier Mar 11 '15 at 09:56
  • The question is opinion based the rule should apply here as well (or the question should be edited to target a precise answer). That said i presume that regex controverse come from the un-precision of the tutorials and manuals about it. Most of the time if not all the time informations are mixed and additionally we are not given all the characteristics. Add to that language miss use, you end up learning something to notice down the road that it may mean something else. And finally special regex characters are not limited to one meaning which add more confusion. – intika Jan 21 '20 at 12:35
  • Perhaps see also [Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms](https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la) which specifically discusses why to avoid regular expressions for structured formats like XML and HTML (and by extension JSON, YAML, source code in most languages, etc). – tripleee Jul 19 '21 at 07:26

22 Answers22

144

I don't think people object to regular expressions because they're slow, but rather because they're hard to read and write, as well as tricky to get right. While there are some situations where regular expressions provide an effective, compact solution to the problem, they are sometimes shoehorned into situations where it's better to use an easy-to-read, maintainable section of code instead.

Kyle Cronin
  • 77,653
  • 43
  • 148
  • 164
  • 2
    And yes, regexes can be extremely extremely slow compared to using simple functions. And not just slow, but the performance of the regex engine can be **totally unpredictable** when faced with arbitrary (user-supplied) inputs. – Pacerier Dec 01 '15 at 21:53
  • 1
    If you know how regex works, it's not a problem at all. – Shiplu Mokaddim Nov 12 '17 at 13:25
  • 13
    @pacerier, it's not _slow patterns_, it's _slow engines_. Most (modern) regular expression _engines_ are unsuitable for complex patterns (e.g. many `|` or `.*`), because they use a stack machine and backtracking. That's why you have to carefully tune your regular expressions in Perl, Java, Python, Ruby… Old-style regular expression engines (in `grep`, for example) first compile the pattern to a DFA. Afterwards, the complexity of the pattern is largely irrelevant. I just used Java and grep for the same text and pattern: 22min vs 2s. Here's the science: http://swtch.com/~rsc/regexp/regexp1.html – hagello Dec 01 '17 at 21:08
  • It should be noted that in many use cases, especially in „one-time-uses“, it is probably better to use a readily available regex-based *tool* (e.g. one of the well-known UNIX tools), rather than write an actual „program“ which requires the installation of a proper development environment, knowledge about the programming language used, etc. – Arno Unkrig Jul 29 '22 at 06:45
135

Making Regexes Maintainable

A major advance toward demystify the patterns previously referred to as “regular expressions” is Perl’s /x regex flag — sometimes written (?x) when embedded — that allows whitespace (line breaking, indenting) and comments. This seriously improves readability and therefore maintainability. The white space allow for cognitive chunking, so you can see what groups with what.

Modern patterns also now support both relatively numbered and named backreferences now. That means you no longer need to count capture groups to figure out that you need $4 or \7. This helps when creating patterns that can be included in further patterns.

Here is an example a relatively numbered capture group:

$dupword = qr{ \b (?: ( \w+ ) (?: \s+ \g{-1} )+ ) \b }xi;
$quoted  = qr{ ( ["'] ) $dupword  \1 }x;

And here is an example of the superior approach of named captures:

$dupword = qr{ \b (?: (?<word> \w+ ) (?: \s+ \k<word> )+ ) \b }xi;
$quoted  = qr{ (?<quote> ["'] ) $dupword  \g{quote} }x;

Grammatical Regexes

Best of all, these named captures can be placed within a (?(DEFINE)...) block, so that you can separate out the declaration from the execution of individual named elements of your patterns. This makes them act rather like subroutines within the pattern.
A good example of this sort of “grammatical regex” can be found in this answer and this one. These look much more like a grammatical declaration.

As the latter reminds you:

… make sure never to write line‐noise patterns. You don’t have to, and you shouldn’t. No programming language can be maintainable that forbids white space, comments, subroutines, or alphanumeric identifiers. So use all those things in your patterns.

This cannot be over-emphasized. Of course if you don’t use those things in your patterns, you will often create a nightmare. But if you do use them, though, you need not.

Here’s another example of a modern grammatical pattern, this one for parsing RFC 5322: use 5.10.0;

$rfc5322 = qr{

   (?(DEFINE)

     (?<address>         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

     (?<addr_spec>       (?&local_part) \@ (?&domain))
     (?<local_part>      (?&dot_atom) | (?&quoted_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                   \] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?&quoted_pair))
     (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

     (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
     (?<quoted_pair>     \\ (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
     (?<qcontent>        (?&qtext) | (?&quoted_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?&quoted_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
     (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
     (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            \x0d \x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [\x20\x09])
   )

   (?&address)

}x;

Isn't that remarkable — and splendid? You can take a BNF-style grammar and translate it directly into code without losing its fundamental structure!

If modern grammatical patterns still aren’t enough for you, then Damian Conway’s brilliant Regexp::Grammars module offers an even cleaner syntax, with superior debugging, too. Here’s the same code for parsing RFC 5322 recast into a pattern from that module:

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;
use Data::Dumper "Dumper";

my $rfc5322 = do {
    use Regexp::Grammars;    # ...the magic is lexically scoped
    qr{

    # Keep the big stick handy, just in case...
    # <debug:on>

    # Match this...
    <address>

    # As defined by these...
    <token: address>         <mailbox> | <group>
    <token: mailbox>         <name_addr> | <addr_spec>
    <token: name_addr>       <display_name>? <angle_addr>
    <token: angle_addr>      <CFWS>? \< <addr_spec> \> <CFWS>?
    <token: group>           <display_name> : (?:<mailbox_list> | <CFWS>)? ; <CFWS>?
    <token: display_name>    <phrase>
    <token: mailbox_list>    <[mailbox]> ** (,)

    <token: addr_spec>       <local_part> \@ <domain>
    <token: local_part>      <dot_atom> | <quoted_string>
    <token: domain>          <dot_atom> | <domain_literal>
    <token: domain_literal>  <CFWS>? \[ (?: <FWS>? <[dcontent]>)* <FWS>?

    <token: dcontent>        <dtext> | <quoted_pair>
    <token: dtext>           <.NO_WS_CTL> | [\x21-\x5a\x5e-\x7e]

    <token: atext>           <.ALPHA> | <.DIGIT> | [!#\$%&'*+-/=?^_`{|}~]
    <token: atom>            <.CFWS>? <.atext>+ <.CFWS>?
    <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
    <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
    <token: dot_atom_text>   <.atext>+ (?: \. <.atext>+)*

    <token: text>            [\x01-\x09\x0b\x0c\x0e-\x7f]
    <token: quoted_pair>     \\ <.text>

    <token: qtext>           <.NO_WS_CTL> | [\x21\x23-\x5b\x5d-\x7e]
    <token: qcontent>        <.qtext> | <.quoted_pair>
    <token: quoted_string>   <.CFWS>? <.DQUOTE> (?:<.FWS>? <.qcontent>)*
                             <.FWS>? <.DQUOTE> <.CFWS>?

    <token: word>            <.atom> | <.quoted_string>
    <token: phrase>          <.word>+

    # Folding white space
    <token: FWS>             (?: <.WSP>* <.CRLF>)? <.WSP>+
    <token: ctext>           <.NO_WS_CTL> | [\x21-\x27\x2a-\x5b\x5d-\x7e]
    <token: ccontent>        <.ctext> | <.quoted_pair> | <.comment>
    <token: comment>         \( (?: <.FWS>? <.ccontent>)* <.FWS>? \)
    <token: CFWS>            (?: <.FWS>? <.comment>)*
                             (?: (?:<.FWS>? <.comment>) | <.FWS>)

    # No whitespace control
    <token: NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]

    <token: ALPHA>           [A-Za-z]
    <token: DIGIT>           [0-9]
    <token: CRLF>            \x0d \x0a
    <token: DQUOTE>          "
    <token: WSP>             [\x20\x09]

    }x;

};


while (my $input = <>) {
    if ($input =~ $rfc5322) {
        say Dumper \%/;       # ...the parse tree of any successful match
                              # appears in this punctuation variable
    }
}

There’s a lot of good stuff in the perlre manpage, but these dramatic improvements in fundamental regex design features are by no means limited to Perl alone. Indeed the pcrepattern manpage may be an easier read, and covers the same territory.

Modern patterns have almost nothing in common with the primitive things you were taught in your finite automata class.

Community
  • 1
  • 1
Joel Berger
  • 20,180
  • 5
  • 49
  • 104
  • 9
    YES! YES! Finally, someone shows a great example of just how readable regexes can be with the x modifier. I can't believe how few people know that it exists, let alone actually use it. – Shabbyrobe Nov 24 '10 at 11:14
  • 1
    @Shabbyrobe: It's not just `/x`. It’s using the regexes grammatically, with `(?&name)` internal regex subroutines, that really makes this shine. – tchrist Nov 24 '10 at 20:04
  • +1 You always learn something new. I didn't know that PCRE had a "false" condition for defines. – NikiC Feb 13 '11 at 13:48
  • 5
    Python similarly has an `re.VERBOSE` flag. – Mechanical snail May 27 '13 at 06:53
  • +10!!! Damn **this** changed my mind about regex (and made me think I'm stupid because I never look at them seriously). **Superb answer.** I just can say "thank you" and I know it's not enough. – Adriano Repetti Sep 03 '14 at 09:19
  • 7
    Just gunna go ahead and say that I am still amazed at the lengths that people will go to in order to make regex usable. – Slater Victoroff Jan 24 '15 at 19:39
  • Might it be best to use `` on this (i.e., turn off code highlighting, since it can't cope with the complexity of the syntax)? – TRiG Jun 13 '17 at 14:47
  • Calling the above “readable code” baffles me. – Adam B May 13 '22 at 04:23
70

Regexes are a great tool, but people think "Hey, what a great tool, I will use it to do X!" where X is something that a different tool is better for (usually a parser). It is the standard using a hammer where you need a screwdriver problem.

Chas. Owens
  • 64,182
  • 22
  • 135
  • 226
  • 6
    Just remember that most parsers -lexical analyzers- still use regular expressions to parse their stuff :-) – Jasper Bekkers Apr 19 '09 at 03:57
  • 64
    Saying that parsers use regular expressions is like saying parsers use assignment statements. It means nothing until you look to see how they are being used. – Chas. Owens Apr 19 '09 at 16:32
  • 26
    Using a RegEx when a parser is better is annoying. Using a RegEx when the language's standard string find or replace functions will work (and in linear time usually) is just unforgivable. – jmucchiello Feb 02 '10 at 21:51
  • 1
    Agreed, because a RegEx has to be a jack of all trades it's processing overhead is huge. Just because using a RegEx engine seems easy doesn't mean it's a better solution over an iterative parser (developer dependent threshold). One of my favourite examples PHP's `split($pattern,$string)` vs `explode($delimiter,$string)` - thankfully the former is getting depreciated, but lots of code used the former when they only needed the power of the later. Aggreed, RegEx's provide an easy tool to do some things but unless you need the full power of regular expressions they – Rudu Sep 01 '10 at 16:03
  • 5
    [Lexical analysers](http://en.wikipedia.org/wiki/Lexical_analysis) may indeed use regexes. They are also known as tokenizers, but they are not [syntactic analysers](http://en.wikipedia.org/wiki/Parsing) (or parsers). To read a complicated enough string, a tokenizer should be used to read the string as tokens (perhaps with regexes, perhaps not, depending on the tokenizer). These tokens should then be passed to the parser, which will process them with grammar rules, which are definitely not regexes. – Axel Apr 13 '11 at 13:27
  • But if you think much enough, you can create a own parser using it. Just like me. –  Aug 31 '17 at 14:58
  • Regexes do not necessarily „have a huge processing overhead“. *Compiling* a regex is expensive, but as soon as that is done, it may execute as fast as „String.indexOf()“. Or even faster, if it uses algorithms like Boyer-Moore-Holbrooke. Take a look at github.com/aunkrig/lfr . – Arno Unkrig Jul 29 '22 at 06:34
54

Almost everyone I know who uses regular expressions regularly (pun intended) comes from a Unix-ish background where they use tools that treat REs as first-class programming constructs, such as grep, sed, awk, and Perl. Since there's almost no syntactic overhead to use a regular expression, their productivity goes way up when they do.

In contrast, programmers who use languages in which REs are an external library tend not to consider what regular expressions can bring to the table. The programmer "time-cost" is so high that either a) REs never appeared as part of their training, or b) they don't "think" in terms of REs and prefer to fall back on more familiar patterns.

Barry Brown
  • 20,233
  • 15
  • 69
  • 105
  • 12
    Yeah, I never forgave Python for making the regex syntax verbose by using a library. I think it's purity over sanity. – slikts Sep 01 '10 at 19:53
  • 8
    I come from a unix background, used sed, awk & perl loads, and of course did plenty of grepping, but know that when I use a regex, it's a write-only hack that I'll hate maintaining. It's good for shell scripts/one-timers, but for real work, for anything that's not just grab-some-data-to-save-now, I now use a proper tokenizer/lexer/parser with clear syntax. My favourite does all/any, cleanly + can self-optimise. I've learnt the hard way, and over many years, that a bit of self-discipline at the start means less effort later. A regex is a moment on the keyboard, and a lifetime on the frown. – AndrewC Sep 17 '12 at 23:51
44

Regular expressions allow you to write a custom finite-state machine (FSM) in a compact way, to process a string of input. There are at least two reasons why using regular expressions is hard:

  • Old-school software development involves a lot of planning, paper models, and careful thought. Regular expressions fit into this model very well, because to write an effective expression properly involves a lot of staring at it, visualizing the paths of the FSM.

    Modern software developers would much rather hammer out code, and use a debugger to step through execution, to see if the code is correct. Regular expressions do not support this working style very well. One "run" of a regular expression is effectively an atomic operation. It's hard to observe stepwise execution in a debugger.

  • It's too easy to write a regular expression that accidentally accepts more input than you intend. The value of a regular expression isn't really to match valid input, it's to fail to match invalid input. Techniques to do "negative tests" for regular expressions are not very advanced, or at least not widely used.

    This goes to the point of regular expressions being hard to read. Just by looking at a regular expression, it takes a lot of concentration to visualize all possible inputs that should be rejected, but are mistakenly accepted. Ever try to debug someone else's regular expression code?

If there's a resistance to using regular expressions among software developers today, I think it's chiefly due to these two factors.

Community
  • 1
  • 1
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
39

People tend to think regular expressions are hard; but that's because they're using them wrong. Writing complex one-liners without any comments, indenting or named captures. (You don't cram your complex SQL expression in one line, without comments, indenting or aliases, do you?). So yes, for a lot of people, they don't make sense.

However, if your job has anything to do with parsing text (roughly any web-application out there...) and you don't know regular expression, you suck at your job and you're wasting your own time and that of your employer. There are excellent resources out there to teach you everything about them that you'll ever need to know, and more.

Jasper Bekkers
  • 6,711
  • 32
  • 46
  • 2
    Well .. the difference is that multiple spaces have meaning in regex, where in other languages they don't and that's why they are usually one liners (that sometimes wrap to multiple lines :) – Rado Aug 06 '09 at 19:58
  • @Rado: In that case it's usually easier to make them explicit as [ ] or \s – Jasper Bekkers Aug 22 '09 at 19:27
  • 14
    @Rado: Perl, for instance, has the `x` modifier for regexes that causes whitespace to be ignored. This allows you to put the regex on a few lines and add comments. – Nathan Fellman Sep 03 '09 at 20:30
  • 9
    Likewise Python has `re.X` a.k.a. `re.VERBOSE`. – Craig McQueen Jan 12 '10 at 07:44
  • 2
    Likewise the `x` modifier in tcl. I believe it's quite standard since tcl, unlike other languages, does not use PCRE. – slebetman Oct 29 '10 at 15:55
  • If you think web data processing = regex you should read (the great) Jeff's [blog entry](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html). I'd agree that if you don't know regex, you're missing an important tool, but conversely if you overuse or abuse regex or use ugly one-liners, where you could write clean, easy-to-maintain xml-library calls, you should be kept away from a compiler. The software maintenance team want to change the locks on your office door. – AndrewC Sep 18 '12 at 00:08
  • 2
    @AndrewC That is one of the grossest misinterpretations this post could've gotten. – Jasper Bekkers Feb 14 '13 at 18:52
  • You didn't explicitly say we should use regex for xml etc, but you do imply that regex is the best for processing most webapp data and indeed most text data. When I was young I wrote loads of regexes, then I learned less brittle, more maintainable ways. It's _not_ saving time for your employer if you do a quick and dirty job now but you waste twice that saved time in the maintenance team. You can "suck at your job" for many reasons, one of which is being a one-trick pony, another is write-only code. You _can_ write well laid out & documented regex; well done if you do. Better if you're robust. – AndrewC Feb 24 '13 at 22:52
  • 2
    @AndrewC The last part of you comment is pretty much repeating what the first paragraph of my posts says and the rest is just rehashing default arguments that everybody heard a million times before. And I didn't imply anything, knowing how, what & when to use regexes is a fundamental if your job requires any kind of text processing. If you use them or not, fine. But you can't do that task well without knowing them. Just FYI, the stuff I didn't explicitly say is left out on purpose. – Jasper Bekkers Feb 25 '13 at 00:11
  • I have to say that knowing _which_ text processing tool is best for the problem at hand, and being able to use _each_ tool well is fundamental. _Overuse_ of regex is harmful. Your answer and "you suck at your job" and all this provides no balance at all; of course it implies you think regex is best. Explicit /= implicit. I first commented because I felt a newbie could interpret your answer as advocating regex for everything. – AndrewC Feb 25 '13 at 00:42
  • @AndrewC The balance is provided in the other answers, though they might not have been there when I posted it, the "watch out for the newbies" posts are in every single discussion about regexes ever and I was counting on them to also appear in this one. – Jasper Bekkers Feb 25 '13 at 00:47
  • I think what I said in the last part had a different meaning. You can be a one-trick pony by just knowing regex. "You _can_ write well laid out..." should have started with the word "Whilst", and finished with ", but this is atypical; well done if you do.". "Better if you're robust." means that many of the alternatives come with better support for fault tolerance. – AndrewC Feb 25 '13 at 00:51
  • Maybe there's a reason that these default arguments have been heard a million times before. Maybe it's good advice born from experience which is worth repeating. – AndrewC Feb 25 '13 at 00:53
  • 1
    @AndrewC If you want to change the grammar, edit the post. The problem I have with the default arguments is that they are not necessarily true. "Regex are unreadable" -> sure if you cram them in one line without comments & subroutines everything is unreadable. "Don't parse xml with regex" -> right, except 1. its fine as long as you don't expect nesting to work for quick tasks (DOM is preferred) and 2. the lexer that'll normally parse the xml is based on regex. "Parsers are better" -> except regexes are the corner stone of most parsers. There is always more nuance than just "x is evil". – Jasper Bekkers Feb 25 '13 at 06:35
  • 1
    Lexers don't parse, and you mean they are potentially misleading rather than "not necessarily true". Other than that, this is my favourite bit of what you've said so far. If you'd started your post with "There is always more nuance than just "x is evil"." and skipped the "suck at your job" bit, I might have upvoted instead of commenting. (I haven't downvoted btw.) Anyway we're having a lengthy discussion and I'd like to call a truce rather than carry on to the point where the engine starts to nag us to go to chat. Happy coding! – AndrewC Feb 28 '13 at 00:55
30

Because they lack the most popular learning tool in the commonly accepted IDEs: There's no Regex Wizard. Not even Autocompletion. You have to code the whole thing all by yourself.

dkretz
  • 37,399
  • 13
  • 80
  • 138
  • 3
    Then you're using the wrong IDE... Even my text editor provides regex hints. – CurtainDog Apr 19 '09 at 01:25
  • The point is that some can't manage very well without it. But what editor are you referring to? And how does it relate to IDE features? – dkretz Apr 19 '09 at 02:42
  • 1
    On a side note, Expresso and The Regex Coach are very useful tools for constructing regular expressions. – Mun Apr 19 '09 at 03:07
  • 27
    How in the world would you autocomplete a regular expression? – AmbroseChapel Apr 19 '09 at 09:55
  • Autocompletes could bring up character sets, greedy vs possessive vs non-greedy matches, look ahead and look behind, also bracket matching, etc. Regexes are succinct but there is still some room for help from the editor. – CurtainDog Apr 19 '09 at 11:59
  • 3
    EditPad Pro has syntax highlighting for regexes in the search box, but I find it more annoying than helpful, and keep it turned off. But I do appreciate it letting me know when I have unmatched brackets; parentheses in particular can be a bear to keep track of. – Alan Moore Apr 20 '09 at 13:58
  • Use Expresso! Regex don't need a wizard their easy to write. – wonea Jul 07 '10 at 14:24
  • 2
    @AmbroseChapel - I'm a couple years late to this discussion. But I created an autocompletion mechanism at http://regexhero.net/tester/ It's initiated by the common constructs inside round `()`, square `[]`, or curly `{}` brackets. It'll also work off of the backslash. – Steve Wortham Mar 09 '11 at 19:24
17

"Regular Expressions: Now You Have Two Problems" is a great article from Jeff Atwood on the matter. Basically, regular expressions are "hard"! They can create new problems. They are effective, however.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Anthony
  • 7,086
  • 1
  • 21
  • 22
16

I don't think they're that controversial.

I also think you've sort of answered your own question, because you point out how silly it would be to use them everywhere (Not everything is a regular language 2) or to avoid using them at all. You, the programmer, have to make an intelligent decision about when regular expressions will help the code or hurt it. When faced with such a decision, two important things to keep in mind are maintainability (which implies readability) and extensibility.

For those that are particularly averse to them, my guess is that they've never learned to use them properly. I think most people who spend just a few hours with a decent tutorial will figure them out and become fluent very quickly. Here's my suggestion for where to get started:

http://docs.python.org/howto/regex

Although that page talks about regular expressions in the context of Python, I've found the information is very applicable elsewhere. There are a few things that are Python-specific, but I believe they are clearly noted, and easy to remember.

allyourcode
  • 21,871
  • 18
  • 78
  • 106
11

Regular expressions are to strings what arithmetic operators are to numbers, and I wouldn't consider them controversial. I think that even a fairly millitant OO activist like myself (who would tend to choose other objects over strings) would be hard pressed to reject them.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
CurtainDog
  • 3,175
  • 21
  • 17
7

The problem is that regexes are potentially so powerful that you can do things with them that you should use something different for.

A good programmer should know where to use them, and where not. The typical example is parsing non-regular languages (see Deciding whether a language is regular).

I think that you can't go wrong if you at first restrict yourself to real regular expressions (no extensions). Some extensions can make your life a bit easier, but if you find something hard to express as a real regex, this may well be an indication that a regex is not the right tool.

Svante
  • 50,694
  • 11
  • 78
  • 122
5

You almost may as well be asking about why goto's are controversial.

Basically, when you get so much "obvious" power, people are apt to abuse them for situations they aren't the best option for. The number of people asking to parse CSVs or XML or HTML in regexes, for example, astounds me. It's the wrong tool for the job. But some users insist on using regexes anyway.

Personally, I try to find that happy medium - use regexes for what they're good for, and avoid them when they're less than optimal.

Note that regexes can still be used to parse CSVs, XML, HTML, etc. But usually not in a single regex.

Tanktalus
  • 21,664
  • 5
  • 41
  • 68
  • Sure you can parse any of these formats in a single regex, that's the power of regexes, baby! Whether or not you want to do that, is a different matter entirely. – Jasper Aug 23 '10 at 12:34
5

I don't think "controversial" is the right word.

But I've seen tons of examples where people say "what's the regular expression I need to do such-and-such a string manipulation?" which are X-Y problems.

In other words, they've started from the assumption that a regex is what they need, but they'd be better off with a split(), a translation like perl's tr/// where characters are substituted one for the other, or just an index().

AmbroseChapel
  • 11,957
  • 7
  • 46
  • 68
4

This is an interesting subject.
Many regexp aficionados seem to confuse the conciseness of the formula with efficiency.
On top of that, a regexp that requires a lot of thought produces to its author a massive satisfaction that makes it legitimate straight away.

But... regexps are so convenient when performance is not an issue and you need to deal quickly with a text output, in Perl for instance. Also, while performance is an issue one may prefer not to try to beat the regexp library by using a homemade algorithm that may be buggy or less efficient.

Besides there are a number of reasons for which regexps are unfairly criticized, for instance

  • the regexp is not efficient, because building the top one is not obvious
  • some programmers "forget" to compile only once a regexp to be used many times (like a static Pattern in Java)
  • some programmers go for the trial and error strategy - works even less with regexps!
Déjà vu
  • 28,223
  • 6
  • 72
  • 100
3

What I think is Learning Regex and maintaining regex makes in unpopular, most of the developers are lazy or most of them rely on external libraries to do the parsing thing for them... they rely on google for the answer and even ask in forums for the complete code for their problem. But when comes to implement or modify/maintain a regex they simply fail.

There is a popular saying "Friends dont let Friends use Regex for Parsing HTML"

But as far as I am concerned I have made complete HTML parsers using Regex and I find my self that regex are better at parsing html strings both speed-wise and memory-wise(if you have an Idea what you what to achieve :) )

Rajeev
  • 4,571
  • 2
  • 22
  • 35
  • 2
    I think it's disingenuous to write off most developers... as lazy. I would say that the syntax is very cryptic, un-intuitive, and full of gotchas, to the un-initiated, which leads to a high barrier-to-entry. For the same reason Perl has a "bad" reputation to many, but is also a very powerful language. It's like trying to read mathematical expressions before you know the symbols. It's daunting, and developers have to be judicial with their time to know they'll get benefits for learning that syntax. – Katastic Voyage Apr 19 '18 at 02:20
  • You **will** miss edge cases in HTML because HTML is not a regular language. You are safe if your intention is to parse a known subset of HTML – Boyang Oct 17 '18 at 06:37
3

Regular expressions are a serious mystery to a lot of people, including myself. It works great but it's like looking at a math equation. I'm happy to report though that somebody has finally created a consolidated location of various regular expression functions at http://regexlib.com/. Now if Microsoft would only create a regular expression class that would automatically do much of the common stuff like eliminating letters, or filtering dates.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Al Katawazi
  • 7,192
  • 6
  • 26
  • 39
  • 2
    You're missing the point. The idea of regexes is that you invest some time in learning them and when you are done, you no longer need some magical "read a date" class. Instead, it takes very little effort regex for them. Moreover, it will take just as little effort to write one for a "yyyy/mm/dd" as it takes to write one for "mm-dd-yyyy", or even one for "mm-yyyy/dd" (which won't happen to often, but it's an example of how you can do things that a magical class never can"). – Jasper Aug 23 '10 at 12:31
2

I find regular expressions invaluable at times. When I need to do some "fuzzy" searches, and maybe replaces. When data may vary and have a certain randomness. However, when I need to do a simple search and replace, or check for a string, I do not use regular expressions. Although I know many people who do, they use it for everything. That is the controversy.

If you want to put a tack in the wall, don't use a hammer. Yes, it will work, but by the time you get the hammer, I could put 20 tacks in the wall.

Regular expressions should be used for what they were designed for, and nothing less.

Brent Baisley
  • 962
  • 1
  • 6
  • 4
1

I think it is a lesser known technique among programmers. So, there is not a wide acceptance for it. And if you have a non-technical manager to review your code or review your work then a regular expression is very bad. You will spend hours writing a perfect regular expression, and you will get few marks for the module thinking he/she has written so few lines of code. Also, as said elsewhere, reading regular expressions are very difficult task.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Satya Prakash
  • 3,372
  • 3
  • 31
  • 47
  • 1
    Reading regular expressions is difficult task only when the programmer who crafted them failed to use whitespace, comments, alphanumeric identifiers, and perhaps also embedded subroutines via delayed execution. In short, all the software engineering techniques applicable to general programming should also be followed in regular expressions. If these principles are ignored, then the writer is not producing professional code. – tchrist Nov 07 '10 at 16:20
  • I think your manager don't know that "The real hero of programming is the one who writes negative code." – Rajeev Mar 31 '11 at 10:22
  • If your manager is going to ding you for accomplishing the job with 3 lines of code (including regexps), while praising some doofus coworker who did it in 900 lines of Assembler... I suggest finding a new job. – Phil Perry Aug 02 '13 at 22:57
0

Decent regular expression systems such as used in lex and yacc for compiler definition are good, very useful and clean. In these systems, expression types are defined in terms of others. It's the hideous malformed unreadable line-noise giant one-liner regular expressions commonly found in perl and sed code (etc.) that are 'controversial' (garbage).

Sam Watkins
  • 7,819
  • 3
  • 38
  • 38
0

While I think regexes are an essential tool, the most annoying thing about them is that there are different implementations. Slight differences in syntax, modifiers, and -especially- "greed" can make things really chaotic, requiring trial-and-error and sometimes generating puzzling bugs.

ndr
  • 129
  • 2
  • how do regex implementations differ in their approach to maximal matching, the thing which I think you are calling “greed”? Do you mean the difference between **leftmost-longest** versus **longest-leftmost** semantics? That’s the only difference I’m aware of; i.e., whether greed trumps eagerness or *vice versa*. – tchrist Nov 07 '10 at 15:15
-1

In some cases I think you HAVE to use them. For instance to build a lexer.

In my opinion, this is a point of view of people who can write regexp and people who don't (or hardly). I personnaly thing this is a good think for example to valid the input of a form, be it in javascript to warn the user, or in server-side language.

Aif
  • 11,015
  • 1
  • 30
  • 44
-4

The best valid and normal usage for regex is for email address format validation.

That's a good application of it.

I have used regular expressions countless times as one-offs in TextPad to massage flat files, create csv files, create SQL insert statements and that sort of thing.

Well written regular expressions shouldn't be too slow. Usually the alternatives, like tons of calls to Replace are far slower options. Might as well do it in one pass.

Many situations call for exactly regular expressions and nothing else.

Replacing special non-printing characters with innocuous characters is another good usage.

I can of course imagine that there are some codebases that overuse regular expressions to the detriment of maintainability. I have never seen that myself. I have actually been eschewed by code reviewers for not using regular expressions enough.

Chris Morley
  • 2,426
  • 2
  • 19
  • 20
  • 10
    Experience shows that regexes are actually a pretty poor tool for email address format validation. A truly complete format validator implemented as a regex is a multi-hundred-character monstrosity, while most of the shorter "good enough" validators that most people take 5 minutes to create will reject large categories of valid, deliverable addresses. – Dave Sherohman Apr 19 '09 at 12:36
  • I hear ya dude. I was talking about the "good enough" and while the large swaths may be large in theory, consider the percentage of coverage you get in such a short expression. I too have seen the monstrosity, but what is your elegant alternative? – Chris Morley Apr 19 '09 at 15:16
  • 2
    I've used something like \w@\w+.\w+ to find email address quickly in a huge directory of files where speed was important and a few false positives or false negatives wasn't important. But the best way to validate an email address seems to be to send email to it. – RossFabricant Sep 03 '09 at 20:39
  • Yeah email the address spec is a nasty mess http://stackoverflow.com/questions/611775/regular-expression-for-valid-email-address-closed – Nick Van Brunt Sep 03 '09 at 21:31
  • @Nick, @Dave: [Mail address validation](http://stackoverflow.com/questions/764247/why-are-regular-expressions-so-controversial/4053506#4053506) need not be a nasty mess. – tchrist Dec 01 '10 at 00:24