Possible Duplicate:
How to parse and process HTML with PHP?
What is in your opinion/experience the best approach when scraping web pages for specific info?
I am building a system (PHP/JS/MySQL) that should automatically scrap specific fields of info from specified web pages. This system needs a functionality that, after you (semi)manually scrape first page on that website, the logic goes to the db and it is used to scrape any other page in that format on the website.
I am able to quickly find and save the HTML DOM (tag name + class + id) and the XPath and add some filter rules (like clean HTML, break at first tag, remove specific words...)
My question (again :)) is: What is best method to use in this automation to be able to scrape the pages properly?
Ex:
Simple HTML DOM: http://simplehtmldom.sourceforge.net/
XPath: http://docs.jquery.com/DOM/Traversing/Selectors#XPath_Selectors
regex....
Any other suggestions are welcome
UPDATE: I have used XPath, Simple HTML DOM and regex. In order to automate and to make an easy configurator (an interface used to configure the scraping rules) when scraping a particular website, Simple HTML DOM is the best. Unfortunately XPath is far from useful in 90% of the cases, Simple HMTL DOM is working for at least 50% of the cases with great successes)
I have also added recently a regex component where I add the rules manually written and they are working very very well (at least 80% if the cases). Is just a lot of manual work