0

I'm a complete newbie with regular expressions and I feel bad about it now, when I need some serious advice on how to extract a link name out of an ahref tag, ie.

<a href="article.html?id=1999874">This article is cool</a>

and I would need to extract "This article is cool", minding that "article.html?id=" CANNOT be avoided. I tried it with

preg_match_all('/<a href="article.html?id=([0-9])">([^<]*)<\/a>/', $webpage, $match);

and what I get back is just

Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) )

Thanks for any valuable advice!

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Sates
  • 408
  • 1
  • 5
  • 21
  • Related: http://stackoverflow.com/a/1732454/1415038 – woz Dec 11 '13 at 18:25
  • 1
    `[0-9]+` because the id contains multiple digits, also escape `?` – nice ass Dec 11 '13 at 18:27
  • **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Dec 11 '13 at 19:02

2 Answers2

0

Onetrickpony got to the heart of what's wrong with your regex: your numeric ID has multiple digits, but your regex only matches a single digit.

There are some other things I'm going to throw out there for your consideration. First, if there are other attributes in your <a> tag, your regex will fail. For example, if there's a target="_blank" attribute, it will mess up your regex. Fortunately, there's an easy way around that:

preg_match_all('/<a .*?href="article\.html\?id=([0-9]+)".*?>(.*?)<\/a>/',
    $webpage, $match);

Essentially, I just padded the href attribute with .*?. The question mark makes the matches lazy (instead of the default of greedy), which will prevent it from consuming more than you want. I also replaced your [^<] with a lazy match, because I generally find it a little cleaner.

UPDATE: As demonking correctly pointed out, the period and question mark in article.html?id= need to be escaped. The period doesn't matter so much, except that leaving it in there would match article_html or anything else, which is probably not a concern. However, not escaping the question mark is trouble. It makes the l in html optional, but then there's nothing to actually match the question mark, which is probably why my uncorrected solution was failing. Thanks, demonking!

Ethan Brown
  • 26,892
  • 4
  • 80
  • 92
  • Than you for your response, but still, what I receive back with $match is: Array ( [0] => Array ( ) [1] => Array ( ) [2] => Array ( ) ) – Sates Dec 11 '13 at 18:40
  • The Problem here is, that you have to escape the ? and .(dot) because this are regex elements – demonking Dec 11 '13 at 19:00
  • Oh, you're right about the ?. Escaping the dot is not as important, but you're technically correct. Good call, demonking. – Ethan Brown Dec 11 '13 at 19:02
0

Your regex should be look something like this:

<a(.+)?href="article\.html\?id=([0-9]+?)">(.+)?<\/a>

The Problem would be, if someone would include some classes or id's to your a href. Then the regex wouldn't work properly.

Example:

<?php

$str = '<a href="article.html?id=1999874">This article is cool</a>';

$matches = array();

preg_match_all('/<a.?href="article\.html\?id=([0-9]+?)">(.+)?<\/a>/',$str,$matches);

var_dump($matches);


?>

Output:

array(3) {
  [0]=>
  array(1) {
    [0]=>
    string(58) "<a href="article.html?id=1999874">This article is cool</a>"
  }
  [1]=>
  array(1) {
    [0]=>
    string(7) "1999874"
  }
  [2]=>
  array(1) {
    [0]=>
    string(20) "This article is cool"
  }
}
demonking
  • 2,584
  • 5
  • 23
  • 28