0

I was thinking about a script that would scan 10+ websites for specific content inside a specific div. Let's say it would be moderately used, some 400 searches a day.

Which of the two in the title would support better the load, take less resources and give better speeds:

Creating the DOM from each of the websites then iterating each for specific div id

OR

creating a string from the website with file_get_contents, and then regexping the needed string.

To be more specific of what kind of operation I would need to execute hear the following,

Additional question: Is regexp capable of searching the following occurrence of the given string:

<div id="myId"> needed string </div>

to identify the tag with the given ID and return ONLY what is between tags?

Please answer only yes/no, if it's possible, I'll open a separate question about syntax so it's not all bundled here.

Yi Jiang
  • 49,435
  • 16
  • 136
  • 136
Michele
  • 3
  • 1
  • Basically what Artefacto said. But don't use the raw DOM methods. Try phpQuery (or maybe QueryPath), which has a nice API for extracting and scraping websites. See also http://stackoverflow.com/questions/3577641/best-methods-to-parse-html – mario Feb 13 '11 at 16:52
  • While I second the link suggested by @mario, I disagree to not using the DOM methods. DOM is great. Unless you want CSS Selectors there is no reason to use a third party lib. – Gordon Feb 13 '11 at 16:58
  • @Gordon: Well I think `echo qp($url)->find("#myId")->text();` is a good enough reason for eschewing the complex API. But QueryPath is also supposedly more resilient against HTML errors. – mario Feb 13 '11 at 17:09
  • @mario Not for me. I know a lot of people think DOM has a too verbose and awkward to use API but I never understood why. It clearly communicates what it does. And since it's language agnostic, I can use the same in other languages. – Gordon Feb 13 '11 at 17:42

2 Answers2

1

For 400 searches a day, which method you use is rather indifferent, performance-wise.

In any case, the fastest method would be file_get_contents+ strpos + substr, unless your location+extraction algorithm is complex enough. Depending on the specific regular expression it may or may not be faster than DOM, but it likely is. DOM will probably be a more reliable method than regular expressions, but than depends on the level of well-formedness of your pages (libxml2 does not exactly mimic the browsers' parsing).

Artefacto
  • 96,375
  • 17
  • 202
  • 225
  • Actually pcre is most typically faster than Zend bytecode and handicrafted strpos/substr code. – mario Feb 13 '11 at 16:48
  • @mario While function calls are expensive in PHP, we're talking about maybe two extra calls here vs compilation of a regular expression. In any case, I added a "save" clause. – Artefacto Feb 13 '11 at 16:53
  • ty for the specifics, great answer – Michele Feb 13 '11 at 17:07
0
  1. Yes

  2. Speed will depend on your server and the pages in question; both ways execution time will be negligible comparing to the time of downloading the pages to scan.

  3. if you go with DOM / XPath, the thing is doable in 3 lines of code.

fdreger
  • 12,264
  • 1
  • 36
  • 42