48

At the risk of open a can of worms and getting negative votes I find myself needing to ask,

When should I use Regular Expressions and when is it more appropriate to use String Parsing?

And I'm going to need examples and reasoning as to your stance. I'd like you to address things like readability, maintainability, scaling, and probably most of all performance in your answer.

I found another question Here that only had 1 answer that even bothered giving an example. I need more to understand this.

I'm currently playing around in C++ but Regular Expressions are in almost every Higher Level language and I'd like to know how different languages use/ handle regular expressions also but that's more an after thought.

Thanks for the help in understanding it!

Edit: I'm still looking for more examples and talk on this but the response so far has been great. :)

Community
  • 1
  • 1
Dan
  • 2,625
  • 7
  • 39
  • 52
  • possible duplicate of [When is it best to use Regular Expressions over basic string spliting / substring'ing?](http://stackoverflow.com/questions/357814/when-is-it-best-to-use-regular-expressions-over-basic-string-spliting-substrin) – nawfal Jun 05 '13 at 10:41

2 Answers2

42

It depends on how complex the language you're dealing with is.

Splitting

This is great when it works, but only works when there are no escaping conventions. It does not work for CSV for example because commas inside quoted strings are not proper split points.

foo,bar,baz

will be split correctly, but

foo,"bar,baz"

will not be.

Regular

Regular expressions are great for simple languages that have a "regular grammar". Perl 5 regular expressions are a little more powerful due to back-references but the general rule of thumb is this:

If you need to match brackets ((...), [...]) or other nesting like HTML tags, then regular expressions by themselves are not sufficient.

You can use regular expressions to break a string into a known number of chunks -- for example, pulling out the month/day/year from a date. They are the wrong tool for parsing complicated arithmetic expressions though.

Obviously, if you write a regular expression, walk away for a cup of coffee, come back, and can't easily understand what you just wrote, then you should look for a clearer way to express what you're doing. Email addresses are probably at the limit of what one can correctly & readably handle using regular expressions.

Context free

Parser generators and hand-coded pushdown/PEG parsers are great for dealing with more complicated input where you need to handle nesting so you can build a tree or deal with operator precedence or associativity.

Context free parsers often use regular expressions to first break the input into chunks (spaces, identifiers, punctuation, quoted strings) and then use a grammar to turn that stream of chunks into a tree form.

The rule of thumb for CF grammars is

If regular expressions are insufficient but all words in the language have the same meaning regardless of prior declarations then CF works.

Non context free

If words in your language change meaning depending on context, then you need a more complicated solution. These are almost always hand-coded solutions.

For example, in C,

#ifdef X
  typedef int foo
#endif

foo * bar

If foo is a type, then foo * bar is the declaration of a foo pointer named bar. Otherwise it is a multiplication of a variable named foo by a variable named bar.

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • Funny you should mention CSV files. They're one of the things that made me want to ask this question. Am I interpreting your example right when I say that you should use a string parser over Regular Expressions when dealing with CSV files? – Dan Aug 10 '12 at 18:08
  • 4
    @Dan, regular expressions can deal with CSV files just fine -- there's no arbitrarily deep nesting, just a two levels deep structure. For IE style , you can find the lines using something like `/([^\r\n"]|"(?:[^"]|"")*")/g` which allows newlines inside quoted strings that use pairs of double-quotes to escape double quotes. Then you can find fields in a line using something like `/([^,"]|"(?:[^"]|"")*")*/g`. Then you just need to find quoted sections using /"(?:[^"]|"")*"/, strip of the outer quotes and replace all occurrences of `""` with `"`. – Mike Samuel Aug 10 '12 at 18:18
  • Years old question, but I want to comment that CSV in general should be handled by a parser, not regex. In fact, there are lots of unexpected traps and pitfalls in csv parsing that you should even be using a mature library for the task instead of a home-cooked solution. (Unless you control the csv end-to-end in your custom application.) – Mizstik May 15 '16 at 09:50
  • 1
    @MikeSamuel - "Email addresses are probably at the limit of what one can correctly & readably handle using regular expressions." Preposterous. Regular expressions are indeed a language all their own and need to be well understood, but that doesn't mean we should write tons of procedural code to parse strings because we just don't understand. Knowing what you're doing with regex makes a world of difference in the maintainability and readability of any code. Regex is complex but extremely standardized. Procedural parsing code is error prone and laborious. – Joey Carson Sep 23 '16 at 00:25
  • @JoeyCarson, You seem to want to rebut the bit you quoted. I assert that (1) there is no dichotomy between regular expressions and procedural code, (2) I never claimed that one need not know regular expressions and the quoted text does not imply that, and (3) knowing regular expression syntax well does not help craft a simple regex solution to email handling -- http://emailregex.com/ is neither small nor simple nor readable. If you believe regexs are a better tool for email handling than CF grammars & code, please provide evidence. Pointers to regexs in webmail systems would be nice. – Mike Samuel Sep 23 '16 at 18:35
  • In some languages you can apply recuresive regex, you can do the bracket thing with it... I used to build regex from string, so I can split it up and explain each part with variable names. – inf3rno Jan 08 '19 at 13:40
  • I agree with @JoeyCarson you don't give enough credit to regex. Another thought that a common mistake is using a single complex "god pattern" instead of multiple simple patterns. There is a similar trend by functional programmers too, they try to solve everything in a single line and wonder that the outcome is not readable. – inf3rno Jan 08 '19 at 13:59
  • @inf3rno, Re "credit", which of my claims about regular expressions do you disagree with? Re JoeyCarson, I asserted that email addresses are at the limit of what you can deal with with regular expressions. JoeyCarson said that was "preposterous." Let's keep things simple and define email address in terms of an [RFC 5322 addr-spec](https://tools.ietf.org/html/rfc5322#section-3.4.1). How small do you think you can get a regular expression that matches all and only addr-specs? – Mike Samuel Jan 08 '19 at 16:41
11

It should be Regular Expression AND String Parsing..

You can use both of them to your advantage!Many a times programmers try to make a SINGLE regular expression for parsing a text and then find it very difficult to maintain..You should use both as and when required.

The REGEX engine is FAST.A simple match takes less than a microsecond.But its not recommended for parsing HTML.

everton
  • 7,579
  • 2
  • 29
  • 42
Anirudha
  • 32,393
  • 7
  • 68
  • 89
  • 3
    `You should use both as and when required.` When? I need an example. I mean what you're saying makes sense but I need an explanation as to what exactly you mean. – Dan Aug 10 '12 at 18:02
  • 2
    @Dan, see my answer for a common case. When parsing a CF language often you use a regular expression to split it into tokens, and then handle that stream of tokens using a full parser. For example, you might break `"(a + b)*c"` into `["(", " ", "a", "+", " ", "b", ")", "*", "c"]` and then throw out the spaces and give the result to a parser to handle the parentheses and operator precedence to produce a tree like `(Times (Plus (Var "a") (Var "b")) (Var "c"))`. – Mike Samuel Aug 10 '12 at 18:22