1

I'm working on a web-site which parses coupon sites and lists those coupons. There are some sites which provide their listings as an XML file - no problem with those. But there are also some sites which do not provide XML. I'm thinking of parsing their sites and get the coupon information from the site content - grabbing that data from HTML with PHP. As an example, you can see the following site:

http://www.biglion.ru/moscow/

I'm working with PHP. So, my question is - is there a relatively easy way to parse HTML and get the data for each coupon listed on that site just like I get while parsing XML?

Thanks for the help.

cycero
  • 4,547
  • 20
  • 53
  • 78

3 Answers3

2

You can always use a DOM parser, but scraping content from sites is unreliable at best.

If their layout changes every so slightly, your app could fail. Oh, and in most cases it's also against most sites TOSs to do so..

Community
  • 1
  • 1
0x6A75616E
  • 4,696
  • 2
  • 33
  • 57
  • Hi and thanks for your answer. I've just found a good one called PHP Simple HTML DOM Parser. That actually does the trick. – cycero Dec 17 '11 at 18:59
0

The most reliable method is the Php DOM Parser if you prefer working with php. Here is an example of parsing only the elements.

// Include the library
include('simple_html_dom.php');


// Retrieve the DOM from a given URL
$html = file_get_html('http://mypage.com/');
// Find all "A" tags and print their HREFs
foreach($html->find('a') as $e) 
echo $e->href . '<br>';

I am providing some more information about parsing the other html elements too. I hope that will be useful to you.

yanis
  • 303
  • 3
  • 15
0

While using a DOM parser might seem a good idea, I usually prefer good old regular expressions for scraping. It's much less work, and if the site changes it's layout you're screwed anyway, whatever your approach is. But, if using a smart enough regex, your code should be immune to changes that do not directly impact the part you're interested in.

One thing to remember is to include some class names in regex when they're provided, but to assume anything can be between the info you need. E.g.

preg_match_all('#class="actionsItemHeadding".*?<a[^>]*href="([^"]*)"[^>]*>(.*?)</a>#s', file_get_contents('http://www.biglion.ru/moscow/'), $matches, PREG_SET_ORDER);
print_r($matches);
a sad dude
  • 2,775
  • 17
  • 20