0

I want to get t value such as 558246017 from this below sample.
The preg_match_all function is unable to get that right.

$str = '<a target="frameleft" href="Home.aspx?t=558246017">START</a>';
preg_match_all('/<a target="frameleft" href="Home.aspx?t=\d+">(.*?)<\/a>/si', $str, $matches);
print_r($matches);

please help me to resolve this problem.

Atur
  • 1,712
  • 6
  • 32
  • 42
DolDurma
  • 15,753
  • 51
  • 198
  • 377
  • 2
    **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester May 16 '14 at 14:37

2 Answers2

3

HTML is not a regular language and can't be reliably parsed using regular expressions. I'd suggest you use a DOM parser instead. PHP has a built-in class (DOMDocument) that excels at these sort of tasks. The advantage of using an HTML parser over regular expressions is that you can always be sure of the results. A regex-based solution might break when the format of the markup changes in future, whereas a DOM parser based solution will not.

You can use DOMDocument to load the string and get the href attribute value first. Then use parse_url() and parse_str() to get the required parameter:

$str = '<a target="frameleft" href="Home.aspx?t=558246017">START</a>';

$dom = new DOMDocument;
$dom->loadHTML($str);

foreach ($dom->getElementsByTagName('a') as $tag) {
    $querystr = parse_url($tag->getAttribute('href'), PHP_URL_QUERY);
    parse_str($querystr, $params);
    echo $params['t'] . PHP_EOL;
}

Output:

558246017

Demo

Amal Murali
  • 75,622
  • 18
  • 128
  • 150
  • @amal-murali thank you very much. like with your code i can not get href values such as ` – DolDurma May 16 '14 at 15:21
  • @TuxWorld: What do you want as the output in such cases where a query string may not be present? Please update the question to include more details and the expected output. – Amal Murali May 16 '14 at 18:45
2

HTML is not a regular language, so you should not use regular expressions to parse it. Use a DOM parser like DOMDocument instead. However, for the sake of learning, I will show what was wrong with your expression.

However, your problem is that ? is a reserved character meaning "optional" and . is a reserved character meaning any character. Escape them using \:

<a target="frameleft" href="Home\.aspx\?t=\d+">(.*?)<\/a>

Also, the s modifier means dot-matches-newline. So, unless you expect the links to have line breaks in them, it is unnecessary.


I also just noticed that you wanted the "t" value. Currently you are using a capture group on the contents of the link ((.*?)), instead you want to capture the value of t (\d+). You'll want to modify this to:

<a target="frameleft" href="Home\.aspx\?t=(\d+)">.*?<\/a>
Community
  • 1
  • 1
Sam
  • 20,096
  • 2
  • 45
  • 71