0

Possible Duplicate:
How to parse and process HTML with PHP?

What is in your opinion/experience the best approach when scraping web pages for specific info?

I am building a system (PHP/JS/MySQL) that should automatically scrap specific fields of info from specified web pages. This system needs a functionality that, after you (semi)manually scrape first page on that website, the logic goes to the db and it is used to scrape any other page in that format on the website.

I am able to quickly find and save the HTML DOM (tag name + class + id) and the XPath and add some filter rules (like clean HTML, break at first tag, remove specific words...)

My question (again :)) is: What is best method to use in this automation to be able to scrape the pages properly?

Ex:

Simple HTML DOM: http://simplehtmldom.sourceforge.net/
XPath: http://docs.jquery.com/DOM/Traversing/Selectors#XPath_Selectors
regex....

Any other suggestions are welcome

UPDATE: I have used XPath, Simple HTML DOM and regex. In order to automate and to make an easy configurator (an interface used to configure the scraping rules) when scraping a particular website, Simple HTML DOM is the best. Unfortunately XPath is far from useful in 90% of the cases, Simple HMTL DOM is working for at least 50% of the cases with great successes)

I have also added recently a regex component where I add the rules manually written and they are working very very well (at least 80% if the cases). Is just a lot of manual work

Community
  • 1
  • 1
Victor Spinei
  • 35
  • 1
  • 12

2 Answers2

0

It's depend on the page you are trying to scrap. If it's well formated (XHTML and/or HTML5 with the right closing tags) you might want to build it using xPath. However, I saw lots of cases where the best (and more harder approach) is to search for specific IDs, DiVs and other element and cut the string from them.

Ido Green
  • 2,795
  • 1
  • 17
  • 26
0

HtmlDom and Xpath are only usefull if you know the format of the page. With scraping that doesn't sound easy.

Regex shouldn't be used for parsing hierarchical data.

Sjuul Janssen
  • 1,772
  • 1
  • 14
  • 28