6

After some research I figured that it is not possible to parse recursive structures (such as HTML or XML) using regular expressions. Is it possible to comprehensively list out day to day coding scenarios where I should avoid using regular expressions because it is just impossible to do that particular task using regular expressions? Let us say the regex engine in question is not PCRE.

Narendra Yadala
  • 9,554
  • 1
  • 28
  • 43
  • 1
    I think your question is too broad. It is not far enough from "when to use a tool". You cannot really expect a definitive answer for all possible cases, can you? When to use a tool: when you understand it, when it simplifies your work, when it makes the code clearer instead of more complicated... When to use regex? When you need to match patterns against strings. Can't do much better than that. – Kobi Sep 26 '11 at 10:48
  • I agree that 'when to use regex' is a broad question. But i think it is useful to be aware of common scenarios where you cannot use regex to accomplish a particular task. This will save a lot of time for the developer. – Narendra Yadala Sep 26 '11 at 10:56
  • See also this question, with an [example of "parsing with regex"](http://stackoverflow.com/a/15589159/287948). – Peter Krauss Mar 24 '13 at 11:03

3 Answers3

29

Don't use regular expressions when:

  • the language you are trying to parse is not a regular language, or
  • when there are readily available parsers specifically made for the data you are trying to parse.

Parsing HTML and XML with regular expressions is usually a bad idea both because they are not regular languages and because libraries already exist that can parse it for you.

As another example, if you need to check if an integer is in the range 0-255, it's easier to understand if you use your language's library functions to parse it to an integer and then check its numeric value instead of trying to write the regular expression that matches this range.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • I understand that but I just want to know some day to day coding scenarios where I should just stay away from regexes. Such as parsing HTML or XML. – Narendra Yadala Sep 26 '11 at 10:29
7

I'll plagiarize myself from my blog post, When to use and when not to use regular expressions...

Public websites should not allow users to enter regular expressions for searching. Giving the full power of regex to the general public for a website's search engine could have a devastating effect. There is such a thing as a regular expression denial of service (ReDoS) attack that should be avoided at all costs.

HTML/XML parsing should not be done with regular expressions. First of all, regular expressions are designed to parse a regular language which is the simplest among the Chomsky hierarchy. Now, with the advent of balancing group definitions in the .NET flavor of regular expressions you can venture into slightly more complex territory and do a few things with XML or HTML in controlled situations. However, there's not much point. There are parsers available for both XML and HTML which will do the job more easily, more efficiently, and more reliably. In .NET, XML can be handled the old XmlDocument way or even more easily with Linq to XML. Or for HTML there's the HTML Agility Pack.

Conclusion

Regular expressions have their uses. I still contend that in many cases they can save the programmer a lot of time and effort. Of course, given infinite time & resources, one could almost always build a procedural solution that's more efficient than an equivalent regular expression.

Your decision to abandon regex should be based on 3 things:

1.) Is the regular expression so slow in your scenario that it has become a bottleneck?

2.) Is your procedural solution actually quicker & easier to write than the regular expression?

3.) Is there a specialized parser that will do the job better?

carla
  • 1,970
  • 1
  • 31
  • 44
Steve Wortham
  • 21,740
  • 5
  • 68
  • 90
4

My rule of thumb is, use regular expressions when no other solution exists. If there's already a parser (for example, XML, HTML) or you're just looking for strings rather than patterns, there's no need to use regular expressions.

Always ask yourself "can I solve this without using regular expressions?". The answer to that question will tell you whether you should use regular expressions.

Bryan Oakley
  • 370,779
  • 53
  • 539
  • 685