4

Recently I was advised by my manager not to depend much on Regex as it has lot of disadvantages. When I tried to learn more , I hear that it has issues like regex can result in memory leak as some objects continue to hang on strings references even after use ?

.NET RegEx "Memory Leak" investigation

So it it right to say that reg-ex causes memory overheads and should not be used if you have other options ? Is there any other disadvantaged to reg-ex (apart from it being tough to learn :) )

P.S I am developing an application (c#.net) similar to web crawler which extracts all hrefs and some other information like title, meta tags etc..I have the option of using HTML Agility pack instead of reg-ex.

Community
  • 1
  • 1
Ananth
  • 10,330
  • 24
  • 82
  • 109
  • 6
    1. No, the primary reason for not using regex for everything is not a possible memory leak. 2. [You can't parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). 3. Use HTML Agility Pack. – dtb Jun 28 '11 at 11:05
  • Using a regex to extract hrefs? what... to parse html? [oh dear](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Marc Gravell Jun 28 '11 at 11:06
  • As an aside, on *several* occasions when we've pegged a server's CPU, the culprit has been a regex hitting a corner-case.... be very very careful with them ;p – Marc Gravell Jun 28 '11 at 11:07
  • @Marc Thanks..ill be :).. As per MSDN , pattern = "href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))" wil do the trick for href .But I went for HTML Agility pack as I needed some other data which be easily and efficiently taken with help of xpath and DOM – Ananth Jun 28 '11 at 11:31

3 Answers3

10

Makes the code difficult to read. Most of the time, even at the expense of having more verbose code, you are better off not using regular expressions. The costly performance impact and the degradation in the readability of the code means that you don't use regexes in most of the cases, especially, the simpler ones and the complex ones.

And for the purpose you are mentioning ( parsing HTML etc. ), regular expressions simple cannot get the job done ( because HTML is not a regular language ). It is is like having a hammer and everything looks like a nail.

manojlds
  • 290,304
  • 63
  • 469
  • 417
  • Your last sentence is strange. Hammer and nails are a perfect fit. – Daniel Hilgarth Jun 28 '11 at 11:09
  • 1
    @ Daniel Hilgarth I think I have used it correctly - http://en.wiktionary.org/wiki/if_all_you_have_is_a_hammer,_everything_looks_like_a_nail – manojlds Jun 28 '11 at 11:10
  • @Daniel: Funnily enough I used the same expression in mine but worded (I think) a little better. The idea is that when you have a hammer everything looks like a nail. eg that screw? bang it in with the hammer! – Chris Jun 28 '11 at 11:11
  • @manojlds: ah, that's what you meant... You should word it like in the link, though – Daniel Hilgarth Jun 28 '11 at 11:12
  • 1
    @manojlds: For fear of deviating off topic into english usage I think the difference is subtle. "all you see is nails" implies that they actually are nails and there is nothing else. "Everything looks like a nail" implies that things only appear to be nails whereas in fact they may not be. – Chris Jun 28 '11 at 11:12
  • @Chris - Thanks, maybe not worded properly, but I think people who have heard that phrase would understand it :) – manojlds Jun 28 '11 at 11:13
  • @manojlds: Yeah, I would have understood what you meant. I was just commenting so that you can use it better next time and then all the people who haven't heard it before will hopefully understand too. :) – Chris Jun 28 '11 at 11:16
1

Regular expressions can obfuscate the logic you are using; it may be less complex to do it in code sometimes. In code you can break the different logical tests up and comment each one so that people can see why you are doing what you are doing.

Paul Richards
  • 1,181
  • 1
  • 10
  • 29
1

My view on this is that RegEx can often do the job but you need to be careful that you don't overuse them. As they say, when all you have is a hammer every problem looks like a nail.

In this case you are trying to parse HTML to get data out. An HTML parser will be both more readable and probably more reliable. Regular Expressions to parse HTML often will either fail in some circumstances (malformed HTML being the big one) or be way more complicated than if you used an HTML parser.

I don't know about the memory leaks and performance issues but even ignoring that I tend to try to keep regex use to a minimum.

Chris
  • 27,210
  • 6
  • 71
  • 92