I'd like to fetch all latest news from this site (at the center board): http://web.hanu.vn/en/ My latest approach was parsing html by using Simple HTML DOM Parser in PHP but I think it's so slow. My idea is to fetch news from almost 20 similar sites like this site. They are all developed by Moodle so they have the same html format. However, with 1 site it takes several seconds to fetch => 20 sites require a lot of time. Is there any better approach rather than parsing HTML? Or should I store the result in the database and after a period of time updating it rather than fetching it for each user request? Am I doing the so-called "crawling", isn't it?
3 Answers
Or should I store the result in the database and after a period of time updating it rather than fetching it for each user request?
Yes, you should. And stick to parsing HTML, do not use regular expressions for parsing HTML.
And what you are trying to do is web scraping, not yet crawling (unless you really crawl the pages).
I recomend you download the page with curl, and do the correct tratament without using regex , try to use substr,strpos, strip tags and so on... and also store the last notices in a database, and update it using cronjob.

- 173
- 7
I'd recomend you to use Reqular Expressions. (Wikipedia) Also, it is very good idea to strip some parts of HTML data using strpos and substr functions, which are faster than regular expressions. And here is nice regular expression tester.

- 50,171
- 52
- 268
- 778
-
1Very bad idea. Regular expressions are one of the worst ways to parse HTML. – Tadeck Nov 22 '12 at 21:02
-
In that case, could you tell me how exactly HTML is parsed, given no regular expressions are used? – Tomáš Zato Nov 22 '12 at 21:16
-
Here is _why_ you should not use regular expressions for that (this s one of numerous articles about it): http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not. To see what is used by HTML parsing libraries internally, please look into their source code. I bet _they_ are using some regular expressions, but believe me: it is not _that_ simple and you should really not try to repeat that in your simple scaping script. – Tadeck Nov 22 '12 at 21:31
-
Here is another example of text discouraging newbie developers from using regular expressions for parsing HTML: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html – Tadeck Nov 22 '12 at 21:33
-
Well, giving me links (and I've read them even before) won't help. I didn't actually ask you about methods how HTML parsers work to tell me, I know that well, but to make you extend your idea. When in tomes to eg. extracting article from HTML, what do you propose to do? Mind the fact that these sites are automatically generated before fetching reply. – Tomáš Zato Nov 22 '12 at 22:18
-
The idea _to not parse HTML using regular expressions_ is already pretty well expanded in the articles I linked, I do not know what else could be clarified here. To extract article from HTML I propose to follow what I already proposed: using HTML parsers (along with their appropriate features, especially selectors, XPath or CSS-like). And the fact they were "automatically generated" does not make any difference (nowadays a lot of content on the Internet is generated or postprocessed automatically) and you should not use regexps for the exact same reasons. – Tadeck Nov 23 '12 at 00:48
-
You should note the fact, that when you are trying to parse specific automatically generated site, you can afford ignoring DOM structure and you can just find string between expected HTML tags with known properties. Of course, to make sure everything works, its good to use a bit more complex algorithm, yet far away from DOM parser. This results in application that runs much much faster, than application, that parses whole DOM to get 10 inputs from table with news feed. But you know you all have now super-fast machines so you just don't care about performance. – Tomáš Zato Nov 23 '12 at 14:34
-
1The problem still remains, even if the value is automatically created. Because even automatically created pages change (like they contain new data, which may be incorrectly matched, or may contain new parts that may fool the mechanism using regular expression). If you control **both** the server and the crawler, you can make such assumptions, otherwise if the page is more complex, you can be sure that at some point it will break. Unless you will spent hours/days/weeks building page-specific parser and correcting it on demand - something that could be minutes for good DOM parser. – Tadeck Nov 23 '12 at 14:43