0

please help me get the link and text from this tag. <h3 class="post-title entry-title"> has to be included because I want the links from that specific tag.

<h3 class="post-title entry-title">
<a href="http://mymplogk.blogspot.com/2011/03/h_25.html">Text</a>
</h3>

my work so far is

<?php

$string = file_get_contents('http://www.domain.com');

$regex_pattern = "";

unset($matches);
preg_match_all($regex_pattern, $string, $matches);


foreach ($matches[0] as $paragraph) {
echo $paragraph;
echo "<br>";
}
?> 

Thank you in advance

EnexoOnoma
  • 8,454
  • 18
  • 94
  • 179
  • possible duplicate of [Regular expression for grabbing the href attribute of an A element](http://stackoverflow.com/questions/3820666/regular-expression-for-grabbing-the-href-attribute-of-an-a-element) – Gordon Mar 24 '11 at 23:08
  • additional usage examples: http://stackoverflow.com/questions/3893375/how-can-i-scrape-a-website-with-invalid-html/3894558#3894558 – Gordon Mar 24 '11 at 23:13
  • *(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Mar 24 '11 at 23:14

4 Answers4

2

Don't use regex to parse HTML. It's a bad idea. Use an HTML/XML parser. Since you are using PHP, you can try using PHP Tidy or DOMDocument. It will make your life much easier.

Community
  • 1
  • 1
Vivin Paliath
  • 94,126
  • 40
  • 223
  • 295
  • I don't think Tidy is really appropriate here -- it offers only the most basic DOM traversal. DOMDocument + XPath is preferable, especially in the context of the OP's requirements. He's essentially described an xpath query something like `//h3[@class="post-title entry-title"]/a/@href` – Frank Farmer Mar 24 '11 at 23:16
  • @Frank I think it depends on what he wants to use. XPath would, of course, be the best. But I find Tidy useful if I'm getting HTML from some other source. I use Tidy to clean up the HTML before I parse it. – Vivin Paliath Mar 24 '11 at 23:18
  • Hi there I have managed and got the link! thank you! But how can I get the text also ? – EnexoOnoma Mar 24 '11 at 23:45
  • @Giannis What are you using? Tidy or `DOMDocument`? If you're using `DOMDocument`, you should be able to obtain a reference to a `DOMNode` object. This object has a `$textContent` property which should give you what you need. See [`DOMNode`](http://www.php.net/manual/en/class.domnode.php). – Vivin Paliath Mar 24 '11 at 23:52
  • I did it :) My last question why I can do it on this link ? http://feeds.feedburner.com/blogspot/hyMBI Does it have any protections ? – EnexoOnoma Mar 25 '11 at 00:22
  • @JBCurious I'm not sure what you mean by "protections"? What are you trying to do with that site? – Vivin Paliath Mar 25 '11 at 15:27
  • @vivin -- DomDocument does pretty well with your average untidy tag-soup HTML, surprisingly; the only bad HTML I remember it struggling with had a

    tag right smack in the middle of it (worst. html. ever.).

    – Frank Farmer Mar 25 '11 at 19:46
  • @Frank Ah, I didn't know that. I guess I'm just paranoid about the structure! – Vivin Paliath Mar 25 '11 at 20:53
0

I would recomend you to use DOMDocument and XPath to extract the url from the page instead of using regexp.

This tutorial gives you some starters how to use xpath and dom. http://www.merchantos.com/blog/makebeta/php/scraping-links-with-php#php_dom

edit: If you use firebug-addon in firefox, you can inspect your element on the page, and copy it's xpath.

heldt
  • 4,166
  • 7
  • 39
  • 67
0

The regex:

(?<=href=").+(?=")

Should match anything in between href tags

You can test this in RegexStorm

Bodman
  • 7,938
  • 3
  • 29
  • 34
0

Following your example, this regex will find "http://mymplogk.blogspot.com/2011/03/h_25.html" and "Text":

$regex_pattern = '/<h3[^>]+class\s*=\s*[\'"]post-title entry-title[\'"][^>]*>.*?<a[^>]+href\s*=\s*"([^"]+)"[^>]*>([^<]*)</s';

This matches single or double quotes around the h3 tag, and allows additional attributes in h3 tag and optional whitespace between attributes and values. It also matches multiple times in $string, e.g.

$string = '<h3 class="post-title entry-title">
<a href="http://mymplogk.blogspot.com/2011/03/h_25.html">Text</a>
</h3>
<p>doot</p>
<h3 class=\'post-title entry-title\'>
<a href="http://www.google.com/">More Text</a>
</h3>';
xn.
  • 15,776
  • 2
  • 30
  • 34
  • Hi there, thanks! It can display any results but i think this is because

    has single quotes on page source, not double. Please can you make me a modification on your pattern? I can't escape the characters correctly by myself. This is what I had before $regex_pattern = "/

    ([^`]*?)<\/h3>/";

    – EnexoOnoma Mar 24 '11 at 23:41
  • (I don't know if this is a good answer, but +1 for polishing the edits!) – Arjan Mar 25 '11 at 11:18