PHP5 webpage scan (simple DOM parser || file_get_contents()+regexp)... resources wise

Question

I was thinking about a script that would scan 10+ websites for specific content inside a specific div. Let's say it would be moderately used, some 400 searches a day.

Which of the two in the title would support better the load, take less resources and give better speeds:

Creating the DOM from each of the websites then iterating each for specific div id

OR

creating a string from the website with file_get_contents, and then regexping the needed string.

To be more specific of what kind of operation I would need to execute hear the following,

Additional question: Is regexp capable of searching the following occurrence of the given string:

<div id="myId"> needed string </div>

to identify the tag with the given ID and return ONLY what is between tags?

Please answer only yes/no, if it's possible, I'll open a separate question about syntax so it's not all bundled here.

Basically what Artefacto said. But don't use the raw DOM methods. Try phpQuery (or maybe QueryPath), which has a nice API for extracting and scraping websites. See also http://stackoverflow.com/questions/3577641/best-methods-to-parse-html — mario, Feb 13 '11 at 16:52
While I second the link suggested by @mario, I disagree to not using the DOM methods. DOM is great. Unless you want CSS Selectors there is no reason to use a third party lib. — Gordon, Feb 13 '11 at 16:58
@Gordon: Well I think `echo qp($url)->find("#myId")->text();` is a good enough reason for eschewing the complex API. But QueryPath is also supposedly more resilient against HTML errors. — mario, Feb 13 '11 at 17:09
@mario Not for me. I know a lot of people think DOM has a too verbose and awkward to use API but I never understood why. It clearly communicates what it does. And since it's language agnostic, I can use the same in other languages. — Gordon, Feb 13 '11 at 17:42

Artefacto · Answer 1 · 2011-02-13T16:54:12.893

1

For 400 searches a day, which method you use is rather indifferent, performance-wise.

In any case, the fastest method would be file_get_contents+ strpos + substr, unless your location+extraction algorithm is complex enough. Depending on the specific regular expression it may or may not be faster than DOM, but it likely is. DOM will probably be a more reliable method than regular expressions, but than depends on the level of well-formedness of your pages (libxml2 does not exactly mimic the browsers' parsing).

edited Feb 13 '11 at 16:54

answered Feb 13 '11 at 16:29

Artefacto

96,375
17
202
225

Actually pcre is most typically faster than Zend bytecode and handicrafted strpos/substr code. – mario Feb 13 '11 at 16:48
@mario While function calls are expensive in PHP, we're talking about maybe two extra calls here vs compilation of a regular expression. In any case, I added a "save" clause. – Artefacto Feb 13 '11 at 16:53
ty for the specifics, great answer – Michele Feb 13 '11 at 17:07

score 0 · Accepted Answer · answered Feb 13 '11 at 16:44

0

Yes
Speed will depend on your server and the pages in question; both ways execution time will be negligible comparing to the time of downloading the pages to scan.
if you go with DOM / XPath, the thing is doable in 3 lines of code.

answered Feb 13 '11 at 16:44

fdreger

12,264
1
36
42

Artefacto's and this answer should be both awarded as a solution. – Michele Feb 13 '11 at 17:06

PHP5 webpage scan (simple DOM parser || file_get_contents()+regexp)... resources wise

2 Answers2