Get an url with preg_match_all (simple)

Question

I really don't know why I can't get some url from some source code from one website with preg_match, maybe it's me doing it wrong, I tried it in so many ways but i can't get it...

The problem is, i'm trying to take only the url from a source code that looks like this:

<h2><a href="http://www.website.com/index.php" h="ID=SERP,5085.1">Website name</a></h2>

So what i want to get to a variable is the http://www.website.com/index.php

I was doing something like this:

preg_match_all('/<h2><a href=".*">/',$text,$m) ;

$text is the source code, its really long website source code so i only want to get the href from tags < a > that are inside tags < h2 > . . . I hope you guys can help me

Not sure what that comment means. What output does your `preg_match_all` currently generate? In other words, what does it do that is wrong? — Avery, Jul 28 '14 at 00:49

hwnd · Accepted Answer · 2014-07-28T00:59:07.200

4

You've asked for a regular expression here, but it's not the right tool for parsing HTML. Use DOM for this:

$html = <<<DATA
<h2><a href="http://www.website.com/index.php" h="ID=SERP,5085.1">Website name</a></h2>
<h2><a href="http://www.example.com">Example site</a></h2>
<h1><a href="http://www.bar.com">Bar</a></h1>
<a href="http://www.foo.com">foo</a>
DATA;

$dom = new DOMDocument;
$dom->loadHTML($html); // Load your HTML data..

$xpath = new DOMXPath($dom);

foreach ($xpath->query("//h2/a") as $tag) {
   $links[] = $tag->getAttribute('href');
}

print_r($links);

Output

Array
(
    [0] => http://www.website.com/index.php
    [1] => http://www.example.com
)

edited Jul 28 '14 at 00:59

answered Jul 28 '14 at 00:49

hwnd

69,796
4
95
132

Indeed Reg ex is not the right tool +1 :) http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Valentin Mercier Jul 28 '14 at 00:52
I get the source code like this: $text = file_get_contents("http://www.website-to-extract-source.com"); So instead of doing << – user3882717 Jul 28 '14 at 00:59
Excellent example, I had no idea about `DOMXPath`. – Havenard Jul 28 '14 at 01:02
@hwnd already tried that, it gives me "Warning: DOMDocument::loadHTML(): Tag header invalid in Entity, line: 9 in /home/nyox/public_html/index.php on line 87" Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 9 in /home/nyox/public_html/index.php on line 87 Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity, line: 16 in /home/nyox/public_html/index.php on line 87 Warning: DOMDocument::loadHTML(): Tag footer invalid in Entity, line: 16 in /home/nyox/public_html/index.php on line 87 – user3882717 Jul 28 '14 at 01:04
@hwnd I remember trying `DOMDocument` like 5 years ago for a project, and I remember it being remarkable slow. The `loadHTML()` alone took about 1 second to load the orkut page. Had to opt out for another solution. Do you know if they fixed this performance issue? – Havenard Jul 28 '14 at 01:07
It's working, thank you very much :D I really didnt knew about the existence of DOM :o – user3882717 Jul 28 '14 at 01:11

imbondbaby · Answer 2 · 2014-07-28T00:49:42.687

0

Try this:

<?php
$string = '<h2><a href="http://www.website.com/index.php" h="ID=SERP,5085.1">Website name</a></h2>';
$url = preg_replace('#.*href="([^\"]+)".*#', '\1', $string);
print_r($url);
?>

Output:

http://www.website.com/index.php

edited Jul 28 '14 at 00:49

answered Jul 28 '14 at 00:43

imbondbaby

6,351
3
21
54

Get an url with preg_match_all (simple)

2 Answers2