-3

When I parse html I always go with the most intuitive way that is to preg_match the page source. I know there are parsers that get the job done with more economical code, such as PHP Simple HTML DOM Parser , but I'm not sure whether parsers are faster than preg_match when I need only a smattering of values from the source.

So, is using parsers faster or just to make the code look better? Assuming we don't use inefficient regex for preg_match.

Lafix
  • 375
  • 1
  • 3
  • 13
  • 2
    Do not use regex to parse HTML... read this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Clay Jan 12 '16 at 07:32
  • Parsers are slower, but much more reliable. It is not really about making code look pretty; it is about not getting wrong results if the HTML file ends up violating the assumptions you carried when you built your regexp. – Amadan Jan 12 '16 at 07:34
  • Using regular expressions is very expensive in terms of performance. Not only is proper parsers better when comparing to performance but they also tend to make it way easier to read and (implicitly) maintain. – Repox Jan 12 '16 at 07:37
  • I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Any of the libxml based libraries should outperform this easily. – AddWeb Solution Pvt Ltd Jan 12 '16 at 07:42
  • 1
    You can use the fantastic [Symfony DomCrawler](http://symfony.com/doc/current/components/dom_crawler.html) + [CssSelector](http://symfony.com/doc/current/components/css_selector.html) components. If you're used to jQuery selectors you'll feel at home with the CSS component. – mTorres Jan 12 '16 at 07:50

1 Answers1

1

It's generally not good idea to parse HTML/XML with regexp. There are lot of special situations which regexp cannot handle - the tag split into several lines, & entities, CDATA and many others.

The real parser (either DOM or SAX if the text is actually xml) is quite quick and the reliability is incomparable.

Zbynek Vyskovsky - kvr000
  • 18,186
  • 3
  • 35
  • 43