Parsing html string in php using regular expression

Question

I want to parse a html string using php (Simple number matching).

<i>1002</i><i>999</i><i>344</i><i>663</i>

and I want the result as an array. eg: [1002,999,344,633,...] I tried like this :

<?php
    $html="<i>1002</i><i>999</i><i>344</i><i>663</i>";
    if(preg_match_all("/<i>[0-9]*<\/i>/",$html, $matches,PREG_SET_ORDER))
        foreach($matches as $match) {
            echo strip_tags($match[0])."<br/>";
        }
?>

and I got the exact output which I want.

But when I try the same code by making a small change in regular expression I'm getting different answer.

Like this:

<?php
    $html="<i>1002</i><i>999</i><i>344</i><i>663</i>";
    if(preg_match_all("/<i>.*<\/i>/",$html, $matches,PREG_SET_ORDER))
        foreach($matches as $match) {
            echo strip_tags($match[0])."<br/>";
        }
?>

Output :

1002999344663

(The regular expression matched the entire string.)

Now I want to know why I'm getting like this? What is the difference if use .* (zero or more) instead of [0-9]* ?

K. So what is '?' there. – Vishal Vijay Feb 19 '13 at 22:02 — Vishal Vijay, Feb 19 '13 at 22:02
@VishalVijay: I'll explain that in an answer :P – gen_Eric Feb 19 '13 at 22:03 — gen_Eric, Feb 19 '13 at 22:03

score 1 · Accepted Answer · answered Feb 19 '13 at 22:05

The .* in your regex matches any character ([0-9]* only matches numbers and  isn't a number). The regex /.*<\/i>/ matches:

<i>1002</i><i>999</i><i>344</i><i>663</i>
^ from here ------------------- to here ^

Since, the whole string is inside .

This is because * is greedy. It takes the max amount of characters it can match.

To fix your problem, you need to use .*?. This makes it takes the minimum amount of characters it can match.

The regex /.*?<\/i>/ will work as you want.

Parsing html string in php using regular expression

1 Answers1