0

I need to scrape the data from an html page

<div style="margin-top: 0px; padding-right: 5px;" class="lftFlt1">

    <a href="" onclick="setList1(157204);return false;" class="contentSubHead" title="USA USA">USA USA</a>
    <div style="display: inline; margin-right: 10px;"><a href="" onclick="rate('157204');return false;"><img src="http://icdn.raaga.com/3_s.gif" title="RATING: 3.29" style="position: relative; left: 5px;" height="10" width="60" border="0"></a></div>
    </div>

I need to scrape the "USA USA" and 157204 from the onclick="setList1...

mechanical_meat
  • 163,903
  • 24
  • 228
  • 223

5 Answers5

2

You should use DOMDocument or XPath. RegEx is generally not recommended for parsing HTML.

Shubham
  • 21,300
  • 18
  • 66
  • 89
1

Use regex:

/setList1\(([0-9]+)\)[^>]+title="([^"]+)"/si

and preg_match() or preg_match_all()

Piotr Müller
  • 5,323
  • 5
  • 55
  • 82
  • 1
    Regex is not the way to parse HTML. See: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 and other responses on that thread. –  Jul 30 '10 at 07:40
1

Please go through my previous answers about how to handle HTML with DOM.

XPath to get the Text Content of all anchor elements:

//a/text()

XPath to get the title attribute of all anchor elements:

//a/@title

XPath to get the onclick attribute of all anchor elements:

//a/@onclick

You will have to use some string function to extract the number from the onclick text.

Community
  • 1
  • 1
Gordon
  • 312,688
  • 75
  • 539
  • 559
0

By far the best lib for scraping is simple html dom. basically uses jquery selector syntax.

http://simplehtmldom.sourceforge.net/

The way you'd get the data in this example:

include("simple_html_dom.php");
$dom=str_get_html("page.html");
$text=$dom->find(".lftFlt1 a.contentSubHead",0)->plaintext;
//or 
$text=$dom->find(".lftFlt1 a.contentSubHead",0)->title;
steve
  • 1
  • That's your opinion. How many other libs have your tried? Suggested third party alternatives that actually use DOM instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html) and [FluentDom](http://www.fluentdom.org). – Gordon Jul 30 '10 at 07:57
  • ya it helped for me but one correction i changed the $dom=file_get_html("page.html"); can u please explain the .lftFlt1 a.contentSubHead this thing??? –  Jul 30 '10 at 08:05
  • @Ram it's a CSS Selector, which (contrary to steve's suggestion) is not a jQuery thing but a W3C standard: http://www.w3.org/TR/CSS2/selector.html – Gordon Jul 30 '10 at 08:14
  • ya ok.. but i am having more class with lftFlt1 in an webpage i used this code it didnt work $dom=file_get_html("http://www.raaga.com/channels/tamil/moviedetail.asp?mid=T0001923"); foreach($text=$dom->find(".lftFlt1 a.contentSubHead",0) as $a) { echo $a->plaintext; } –  Jul 30 '10 at 08:22
  • @Ram obviously. the second argument to `find` returns the nth-child. See the docs http://simplehtmldom.sourceforge.net/manual.htm – Gordon Jul 30 '10 at 08:36
  • $str = <<Kakha Kakha HTML; $d = str_get_html($str); foreach($d->find('a') as $e) echo $e->onClick.'
    '; this return null value i cannot get the value of on click???
    –  Jul 30 '10 at 09:18
0

I did it this way

$a=$coll->find('div[class=lftFlt1]');
$text=$element->find("a[class=cursor]",0)->onclick;
Kaletha
  • 3,189
  • 3
  • 19
  • 10