php : how to get all hyperlinks from a specific div of a given page?

Question

I'm trying to get all link URL of news on some div from this web

To get all link, after I view source but there is nothing.

But there are any data display

Could any that understand PHP, Array() and JS help me, please?

This is my code to get the content:

$html = file_get_contents("https://qc.yahoo.com/");
if ($result === FALSE) {
    die("?");
} 
echo $html;

I'm having a hard time understanding. It would help if you showed us a sample `$html` input, and what you would like to have when you're done processing. Just a small sample, enough that we understand what you're trying to do. — BeetleJuice, Jul 15 '16 at 09:15
hy @BeetleJuice have u check http://stackoverflow.com/a/38396700/6516181 that what i mean, sorry im not advanced in coding & name of keyword. Please your help ^^ — sikuda, Jul 16 '16 at 02:15

unixmiah · Answer 1 · 2016-07-18T05:55:01.273

3

$html = new DOMDocument();
@$html->loadHtmlFile('https://qc.yahoo.com/');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[@id='news_moreTopStories']//a/@href" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n";
}

you can get all links from the divs you specify. make sure you put the div ids in id='news_moreTopStories']. you're using xpath to query the divs. you don't need a ton of code, just this portion.

http://php.net/manual/en/class.domxpath.php

edited Jul 18 '16 at 05:55

answered Jul 18 '16 at 05:27

unixmiah

3,081
1
12
26

hy sir, thank you for helping us too, this will be add more solutions for me ^^ – sikuda Jul 18 '16 at 05:53
Yes this a better solution, but it doesn't seem to decode the gzip-ed content. – gregn3 Jul 18 '16 at 06:10

score 2 · Accepted Answer · edited May 23 '17 at 12:33

2

Assuming, you want to extract all Anchor Tags with their hyperlinks from the given page.

Now there are certain problems with doing file_get_contents on that URL :

Character encoding for Compression, i.e gzip
SSL Verification of the URL.

So, to overcome first problem of gzip character encoding, we'll use CURL as @gregn3 suggested in his answer. But he missed to use CURL's ability to automatically decompress gziped content.

For second problem, you can either follow this guide or disable SSL verification from CURL's curl_setopt methods.

Now the code which will extract all the links from the given page is :

<?php

$url = "https://qc.yahoo.com/";

# download resource
$c = curl_init ($url);
curl_setopt($c, CURLOPT_HTTPHEADER, ["Accept-Encoding:gzip"]);
curl_setopt ($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($c, CURLOPT_ENCODING , "gzip");
curl_setopt($c, CURLOPT_VERBOSE, 1);
curl_setopt($c, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($c, CURLOPT_SSL_VERIFYHOST, 0);
$content = curl_exec ($c);

curl_close ($c);

$links = preg_match_all ("/href=\"([^\"]+)\"/i", $content, $matches);

# output results
echo "url = " . htmlspecialchars ($url) . "<br>";
echo "links found (" . count ($matches[1]) . "):" . "<br>";
$n = 0;
foreach ($matches[1] as $link)
{
    $n++;
    echo "$n: " . htmlspecialchars ($link) . "<br>";
}

But if you want to do advance html parsing, then you'll need to use PHP Simple HTML Dom Parser. In PHP Simple HTML Dom you can select the div by using jQuery selectors and fetch the anchor tags. Here are it's documentation & api manual.

edited May 23 '17 at 12:33

Community

1
1

answered Jul 18 '16 at 04:57

Deepak Chaudhary

476
5
15

Thanks @Deepak , I was not very familiar with CURL , but now I know about this too. :) – gregn3 Jul 18 '16 at 05:51
no i like this. this make me more understand. thank you for describe & knowledge sir :* kiss hug .. #awesome btw what have you socmed, i want add you sir – sikuda Jul 18 '16 at 05:51
:) and Sorry, I don't know what *socmed* is. – Deepak Chaudhary Jul 18 '16 at 06:01
@DeepakChaudhary social media sir.. :3 – sikuda Jul 18 '16 at 08:36
Ahh.. :D I'm not that active on socmed. – Deepak Chaudhary Jul 18 '16 at 08:44
hii @DeepakChaudhary now i try is the result is not by set url, the default direct to yahoo.com "US version" how to display into qc.yahoo.com results? – sikuda Aug 25 '16 at 08:10
If you try by yahoo.com yahoo will redirect you to th3 country specific version(us.yahoo.com) by the IP you're requesting from, so in order to get qc.yahoo.com results you need to request that explicitly. – Deepak Chaudhary Aug 25 '16 at 08:30
@DeepakChaudhary uhm sorry i cant understand, i know there is any php make redirect to "us" , so this is possible or not to get data from qc.yahoo.com? how listing sir? please your help :) – sikuda Aug 25 '16 at 09:46
If you want US VERSION then use us.yahoo.com instead of qc.yahoo.com. – Deepak Chaudhary Aug 25 '16 at 09:48
@DeepakChaudhary i want is get all link in qc.yahoo.com, but result is displaying all "us" version. yahoo create protection to all sub domain direct to "us" version i think. Sorry before for missunderstood, i use gugel translate. here img i try [![2016-08-26_053733.png](https://s3.postimg.org/y3zqjjp9f/2016_08_26_053733.png)](https://postimg.org/image/7ix7nzmvz/) – sikuda Aug 25 '16 at 22:35
If Yahoo is redirecting you forcefully, then you can use a france proxy server to do the `cURL` and get a luck. For more info see this stackoverflow answer [how to use proxy server in php cURL](http://stackoverflow.com/a/9247672/5022546). And [France Proxy Servers List](http://spys.ru/free-proxy-list/FR/) for the proxy server. – Deepak Chaudhary Aug 26 '16 at 09:31

gregn3 · Answer 3 · 2016-07-17T15:23:36.637

0

To find all links in HTML you could use preg_match_all().

$links = preg_match_all ("/href=\"([^\"]+)\"/i", $content, $matches);

That url https://qc.yahoo.com/ uses gzip compression , so you have to detect that and decompress it using the function gzdecode(). (It must be installed in your PHP version)

The gzip compression is indicated by the Content-Encoding: gzip HTTP header. You have to check that header, so you must use curl or a similar method to retrieve the headers. (file_get_contents() will not give you the HTTP headers... it only downloads the gzip compressed content. You need to detect that it is compressed but for that you need to read the headers.)

Here is a complete example:

<?php

$url = "https://qc.yahoo.com/";

# download resource
$c = curl_init ($url);
curl_setopt ($c, CURLOPT_HEADER, true);
curl_setopt ($c, CURLOPT_RETURNTRANSFER, true);
$content = curl_exec ($c);
$hsize = curl_getinfo ($c, CURLINFO_HEADER_SIZE);
curl_close ($c);

# separate headers from content
$headers = substr ($content, 0, $hsize);
$content = substr ($content, $hsize);

# check if content is compressed with gzip
$gzip = 0;
$headers = preg_split ('/\r?\n/', $headers);
foreach ($headers as $h)
{
    $pieces = preg_split ("/:/", $h, 2);
    $pieces2 = (count ($pieces) > 1);
    $enc = $pieces2 && (preg_match ("/content-encoding/i", $pieces[0]) );
    $gz = $pieces2 && (preg_match ("/gzip/i", $pieces[1]) );
    if ($enc && $gz)
    {
        $gzip = 1;
        break;
    }
}

# unzip content if gzipped
if ($gzip)
{
    $content = gzdecode ($content);
}


# find links
$links = preg_match_all ("/href=\"([^\"]+)\"/i", $content, $matches);

# output results
echo "url = " . htmlspecialchars ($url) . "<br>";
echo "links found (" . count ($matches[1]) . "):" . "<br>";
$n = 0;
foreach ($matches[1] as $link)
{
    $n++;
    echo "$n: " . htmlspecialchars ($link) . "<br>";
}

edited Jul 17 '16 at 15:23

answered Jul 15 '16 at 12:55

gregn3

1,728
2
19
27

1

hy @gregn3 thank you for understand my post what i dont know the keyword, after i use your code i get eroor, here i check my php 5.6.23, gzdecode OK, zlib extension loaded, **but** PHP Fatal error: Call to undefined function gzip_inflate() generate.. why ? Please your help. – sikuda Jul 16 '16 at 02:14
1

btw sorry before i want give upvote **but** Thanks for the feedback! Votes cast by those with less than 15 reputation are recorded, but do not change the publicly displayed post score #myrputation is bad T.T – sikuda Jul 16 '16 at 02:18
example if i open form original site there is have 10 links. **but** when i curl the site they display only 5 links.. how to display all links? – sikuda Jul 16 '16 at 06:18
@ane Hi ane, to get all links on the page you could try to tweak the regex used. Maybe this is not matching all of them: `"/href=\"([^\"]+)\"/i"` – gregn3 Jul 17 '16 at 15:01
@ane You are right, there is only a **gzdecode** or gzinflate function. I'll update my post to use gzdecode. – gregn3 Jul 17 '16 at 15:21
Setting Accept-Encoding header to empty string(and if server respects that) will give you plain-text response, no need to gzdecode it. But the data will not be compressed and hence will consume more bandwidth. – Deepak Chaudhary Jul 17 '16 at 16:13
@gregn3 your guru, thats work hhee.. thank you so much for help. can i have your socmed account? :3 – sikuda Jul 18 '16 at 03:00
@DeepakChaudhary please give example code, i want to understand what your tutorial sir ^^ please your help .. – sikuda Jul 18 '16 at 03:01
@ane, Writing that in separate answer. – Deepak Chaudhary Jul 18 '16 at 04:44
@DeepakChaudhary I have tried to set the Accept-Encoding header, but this server doesn't seem to respect it. – gregn3 Jul 18 '16 at 06:06
1

Then adding curl option `curl_setopt($c, CURLOPT_ENCODING , "gzip");` will do the task. After that, curl itself will decompress the response. – Deepak Chaudhary Jul 18 '16 at 06:41

php : how to get all hyperlinks from a specific div of a given page?

3 Answers3