-2

I really don't know why I can't get some url from some source code from one website with preg_match, maybe it's me doing it wrong, I tried it in so many ways but i can't get it...

The problem is, i'm trying to take only the url from a source code that looks like this:

<h2><a href="http://www.website.com/index.php" h="ID=SERP,5085.1">Website name</a></h2>

So what i want to get to a variable is the http://www.website.com/index.php

I was doing something like this:

preg_match_all('/<h2><a href=".*">/',$text,$m) ;

$text is the source code, its really long website source code so i only want to get the href from tags < a > that are inside tags < h2 > . . . I hope you guys can help me

2 Answers2

4

You've asked for a regular expression here, but it's not the right tool for parsing HTML. Use DOM for this:

$html = <<<DATA
<h2><a href="http://www.website.com/index.php" h="ID=SERP,5085.1">Website name</a></h2>
<h2><a href="http://www.example.com">Example site</a></h2>
<h1><a href="http://www.bar.com">Bar</a></h1>
<a href="http://www.foo.com">foo</a>
DATA;

$dom = new DOMDocument;
$dom->loadHTML($html); // Load your HTML data..

$xpath = new DOMXPath($dom);

foreach ($xpath->query("//h2/a") as $tag) {
   $links[] = $tag->getAttribute('href');
}

print_r($links);

Output

Array
(
    [0] => http://www.website.com/index.php
    [1] => http://www.example.com
)
hwnd
  • 69,796
  • 4
  • 95
  • 132
  • Indeed Reg ex is not the right tool +1 :) http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Valentin Mercier Jul 28 '14 at 00:52
  • I get the source code like this: $text = file_get_contents("http://www.website-to-extract-source.com"); So instead of doing << – user3882717 Jul 28 '14 at 00:59
  • Excellent example, I had no idea about `DOMXPath`. – Havenard Jul 28 '14 at 01:02
  • @hwnd already tried that, it gives me "Warning: DOMDocument::loadHTML(): Tag header invalid in Entity, line: 9 in /home/nyox/public_html/index.php on line 87" Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 9 in /home/nyox/public_html/index.php on line 87 Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 16 in /home/nyox/public_html/index.php on line 87 Warning: DOMDocument::loadHTML(): Tag footer invalid in Entity, line: 16 in /home/nyox/public_html/index.php on line 87 – user3882717 Jul 28 '14 at 01:04
  • @hwnd I remember trying `DOMDocument` like 5 years ago for a project, and I remember it being remarkable slow. The `loadHTML()` alone took about 1 second to load the orkut page. Had to opt out for another solution. Do you know if they fixed this performance issue? – Havenard Jul 28 '14 at 01:07
  • It's working, thank you very much :D I really didnt knew about the existence of DOM :o – user3882717 Jul 28 '14 at 01:11
0

Try this:

<?php
$string = '<h2><a href="http://www.website.com/index.php" h="ID=SERP,5085.1">Website name</a></h2>';
$url = preg_replace('#.*href="([^\"]+)".*#', '\1', $string);
print_r($url);
?>

Output:

http://www.website.com/index.php
imbondbaby
  • 6,351
  • 3
  • 21
  • 54