scrape the data from html page php

Question

I need to scrape the data from an html page

<div style="margin-top: 0px; padding-right: 5px;" class="lftFlt1">

    <a href="" onclick="setList1(157204);return false;" class="contentSubHead" title="USA USA">USA USA</a>
    <div style="display: inline; margin-right: 10px;"><a href="" onclick="rate('157204');return false;"><img src="http://icdn.raaga.com/3_s.gif" title="RATING: 3.29" style="position: relative; left: 5px;" height="10" width="60" border="0"></a></div>
    </div>

I need to scrape the "USA USA" and 157204 from the onclick="setList1...

Do you want to scrap or to scrape? Keep the data or keep the HTML and throw the data away? — relet, Jul 30 '10 at 07:18
@relet: good question. I see this misspelling often, so I've edited to 'scrape'. @Ram: please rollback if 'scrap' is truly your intent. — mechanical_meat, Jul 30 '10 at 07:20
Possible duplicate of [Scrape web page contents](http://stackoverflow.com/questions/584826/scrape-web-page-contents) — John Slegers, Feb 25 '16 at 16:54

Shubham · Answer 1 · 2010-07-30T07:35:31.437

2

You should use DOMDocument or XPath. RegEx is generally not recommended for parsing HTML.

edited Jul 30 '10 at 07:35

answered Jul 30 '10 at 07:27

Shubham

21,300
18
66
89

score 1 · Answer 2 · answered Jul 30 '10 at 07:26

1

Use regex:

/setList1\(([0-9]+)\)[^>]+title="([^"]+)"/si

and preg_match() or preg_match_all()

answered Jul 30 '10 at 07:26

Piotr Müller

5,323
5
55
82

1

Regex is not the way to parse HTML. See: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 and other responses on that thread. – Jul 30 '10 at 07:40

score 1 · Answer 3 · edited May 23 '17 at 12:31

1

Please go through my previous answers about how to handle HTML with DOM.

XPath to get the Text Content of all anchor elements:

//a/text()

XPath to get the title attribute of all anchor elements:

//a/@title

XPath to get the onclick attribute of all anchor elements:

//a/@onclick

You will have to use some string function to extract the number from the onclick text.

edited May 23 '17 at 12:31

Community

1
1

answered Jul 30 '10 at 07:32

Gordon

312,688
75
539
559

score 0 · Answer 4 · answered Jul 30 '10 at 07:56

0

By far the best lib for scraping is simple html dom. basically uses jquery selector syntax.

http://simplehtmldom.sourceforge.net/

The way you'd get the data in this example:

include("simple_html_dom.php");
$dom=str_get_html("page.html");
$text=$dom->find(".lftFlt1 a.contentSubHead",0)->plaintext;
//or 
$text=$dom->find(".lftFlt1 a.contentSubHead",0)->title;

answered Jul 30 '10 at 07:56

steve

1

That's your opinion. How many other libs have your tried? Suggested third party alternatives that actually use DOM instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html) and [FluentDom](http://www.fluentdom.org). – Gordon Jul 30 '10 at 07:57
ya it helped for me but one correction i changed the $dom=file_get_html("page.html"); can u please explain the .lftFlt1 a.contentSubHead this thing??? – Jul 30 '10 at 08:05
@Ram it's a CSS Selector, which (contrary to steve's suggestion) is not a jQuery thing but a W3C standard: http://www.w3.org/TR/CSS2/selector.html – Gordon Jul 30 '10 at 08:14
ya ok.. but i am having more class with lftFlt1 in an webpage i used this code it didnt work $dom=file_get_html("http://www.raaga.com/channels/tamil/moviedetail.asp?mid=T0001923"); foreach($text=$dom->find(".lftFlt1 a.contentSubHead",0) as $a) { echo $a->plaintext; } – Jul 30 '10 at 08:22
@Ram obviously. the second argument to `find` returns the nth-child. See the docs http://simplehtmldom.sourceforge.net/manual.htm – Gordon Jul 30 '10 at 08:36
$str = <<Kakha Kakha HTML; $d = str_get_html($str); foreach($d->find('a') as $e) echo $e->onClick.'
'; this return null value i cannot get the value of on click??? – Jul 30 '10 at 09:18

score 0 · Answer 5 · answered Apr 07 '11 at 07:05

0

I did it this way

$a=$coll->find('div[class=lftFlt1]');
$text=$element->find("a[class=cursor]",0)->onclick;

answered Apr 07 '11 at 07:05

Kaletha

3,189
3
19
10

scrape the data from html page php

5 Answers5

Linked