8

I am trying to create a simple alert app for some friends.

Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two:

I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and quantity available and alert us to make an order if a product is between some thresholds.

I have tried some regex (found on some tutorials, but i an way too n00b for this) but haven't managed to get this working, any good tips or examples?

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Mike
  • 965
  • 4
  • 15
  • 24
  • 1
    You could post what you have tried so far.... – Felix Kling Jan 07 '10 at 11:38
  • 1
    **Don't use regular expressions to parse HTML**. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Jun 05 '13 at 04:20

6 Answers6

32
$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');

preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];

preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];

echo "Price: $price - Availability: $in_stock\n";
Matteo Riva
  • 24,728
  • 12
  • 72
  • 104
  • thsi works like a charm at a first look, and is just the simple solution i was lookfin for !!! thanks a lot – Mike Jan 07 '10 at 12:08
  • very easily modified to get the product name and other info out of the text.... WOW 10x a lot, i mean... it's just the simplest way to get some meaningful data out of many simple websites. – Mike Jan 07 '10 at 12:12
  • 1
    You're welcome :) If you have specific needs, regular expressions can be perfectly fine to mine data from an HTML page. They break if the structure of the page changes, but so do solutions based on parsers. – Matteo Riva Jan 07 '10 at 12:30
  • the ony thing that can change is different links on the page or some stuff like that, but i do check the website a lot and i can tell if it has changed the design and make the appropriate change in the regex. – Mike Jan 07 '10 at 12:43
  • no matter what this is the answer i was looking for. anyone looking to do this .... this is worth 2min. for looking into. – Mike Jan 08 '10 at 10:29
8

It's called screen scraping, in case you need to google for it.

I would suggest that you use a dom parser and xpath expressions instead. Feed the HTML through HtmlTidy first, to ensure that it's valid markup.

For example:

$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[@class="pricing"]/th') as $node) {
  echo $node, "\n";
}
troelskn
  • 115,121
  • 27
  • 131
  • 155
  • 4
    A car is the best choice for general travelling, but if you need to visit your neighbour a simple walk might suffice. – Matteo Riva Jan 07 '10 at 17:24
5

What ever you do: Don't use regular expressions to parse HTML or bad things will happen. Use a parser instead.

Community
  • 1
  • 1
  • I think regular expressions are ok for very specific use cases (i.e. the markup/text is always the same). But of course not for validating HTML etc. Parsers are always a good solution but sometimes they are overkill. – Felix Kling Jan 07 '10 at 11:37
  • i thought a regex would do the trick here since i only try to extract 2 info's from the page, and the format is quite standard... – Mike Jan 07 '10 at 11:40
  • 1
    @Felix Did your read the graphic description of what happens if you try to parse HTML with regular expressions. If are very daring, click on the first link in my answer. –  Jan 07 '10 at 11:40
  • 1
    @Mike A "standard" format sounds like an ideal opportunity to use a standard tool: a parser. –  Jan 07 '10 at 11:40
  • @lutz: I only say that if the scope is clear, regex can be a fast/easy solution. I don't say regex should be used to analyze HTML in general. – Felix Kling Jan 07 '10 at 11:48
  • 1
    -1 for linking YET AGAIN that answer. Really, give us a break. – Matteo Riva Jan 07 '10 at 12:01
2

You are probably best off loading the HTML code into a DOM parser like this one and searching for the "pricing" table. However, any kind of scraping you do can break whenever they change their page layout, and is probably illegal without their consent.

The best way, though, would be to talk to the people who run the site, and see whether they have alternative, more reliable forms of data delivery (Web services, RSS, or database exports come to mind).

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • i only want to do this for me and my friend so that we can have an script look through the website every hour. they do not suport any web services at this time. database exports... haha, i really don't think so. – Mike Jan 07 '10 at 11:53
  • Yes. Many sites prohibit any kind of automated browsing/downloading/parsing of their sites' contents in their terms of service. In many jurisdictions, this works and can be enforced. It's unlikely there is going to be any trouble in this case but it's still always worth noting. – Pekka Jan 07 '10 at 13:09
  • Pekka do you have some sources on that? I'm interested in this subject – Matteo Riva Jan 07 '10 at 14:19
  • Scraping data and re-publishing it is a copyright offense in most parts of the world. When it comes to scraping it for private use, the situation looks less unequivocal than I thought. I came across this Google Answers question http://answers.google.com/answers/threadview?id=746810 it is related to India but makes a few international points, too. – Pekka Jan 07 '10 at 15:17
  • Well republishing copyright protected contents is an offense even if you do it by hand, I was interested about the illegal part of making an automated script to extract them -- not what you do with that data. – Matteo Riva Jan 07 '10 at 17:12
  • As I said, it's not as straightforward as re-publishing, and not as easy to attack. Check out the link I posted, there are some pointers there. – Pekka Jan 07 '10 at 17:15
2

1st, asking this question goes too into details. 2nd, extracting data from a website might not be legitimate. However, I have hints:

  1. Use Firebug or Chrome/Safari Inspector to explore the HTML content and pattern of interesting information

  2. Test your RegEx to see if the match. You may need do it many times (multi-pass parsing/extraction)

  3. Write a client via cURL or even much simpler, use file_get_contents (NOTE that some hosting disable loading URLs with file_get_contents)

For me, I'd better use Tidy to convert to valid XHTML and then use XPath to extract data, instead of RegEx. Why? Because XHTML is not regular and XPath is very flexible. You can learn XSLT to transform.

Good luck!

Viet
  • 17,944
  • 33
  • 103
  • 135
0

The simplest method to extract data from Website. I've analysed that my all data is covered within <h3> tag only, so I've prepared this one.

<?php
    include(‘simple_html_dom.php’);
        // Create DOM from URL, paste your destined web url in $page 
        $page = ‘http://facebook4free.com/category/facebookstatus/amazing-facebook-status/’;
        $html = new simple_html_dom();
        
       //Within $html your webpage will be loaded for further operation
        $html->load_file($page);
        
        // Find all links
        $links = array();
        //Within find() function, I have written h3 so it will simply fetch the content from <h3> tag only. Change as per your requirement.
       foreach($html->find(‘h3′) as $element) 
        {
            $links[] = $element;
        }
        reset($links);
        //$out will be having each of HTML element content you searching for, within that web page
        foreach ($links as $out) 
        {
            echo $out;
        }                
    
?>
Pang
  • 9,564
  • 146
  • 81
  • 122