1

I have a regex pattern expected to capture the src and height (which is possibly in the height or style attribute) from some <img> html elements. Here is my pattern:

/img[^\>]*(?:height="([\d]+)")?[^\>]*src="([^"]+)"[^\>]*(?:style\="height:([\d]+)px;?[^"]+")?[^\>]*/i

I use the preg_match_all function to search for the following string:

<img alt="" height="200" src="http://www.example.com/example.png" width="1500" style="height:200px;" />

There is no problem with src, but it cannot capture the height subgroups. Am I wrong with the regex pattern?

micmia
  • 1,371
  • 1
  • 14
  • 29
  • 3
    This is called parsing. Don't use Regular Expressions for parsing HTML documents. Use a DOM parser instead. – revo Mar 12 '18 at 12:09
  • 4
    Because the `height` group is followed by `?` it becomes optional. The `[^\>]*` subexpression in front of it is greedy and matches everything until `src=`. Btw, `>` is not a special regex character, it doesn't need to be escaped. The same for `=`. Read about [meta characters](http://php.net/manual/en/regexp.reference.meta.php) and [repetition](http://php.net/manual/en/regexp.reference.repetition.php) in PHP PCRE then get rid of the `regex` (it won't match if the attributes are in a different order) and [use a DOM parser to parse HTML fragments](https://stackoverflow.com/a/1732454/4265352). – axiac Mar 12 '18 at 12:16

2 Answers2

0

If it is an option for you, you could use the DOM instead of a regex to get the src and the height:

var div = document.createElement('div');
div.innerHTML = '<img alt="" height="200" src="http://www.example.com/example.png" width="1500" style="height:200px;" />';
var elm = div.firstChild;
console.log(elm.src);
console.log(elm.height);
console.log(elm.style.height);
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

If you choose to parse with regex - it's better to capture info step-by-step:

  1. At first capture img elements

  2. Then - inside element - capture src, height, style-height attributes

In this case you don't need to worry if order of attributes changes in the future. Code example:

$str = '<img alt="" height="210" src="http://www.example.com/example1.png" width="1500" style="height:220px;" />
        <img alt="" src="http://www.example.com/example2.png" height="230" width="1500" style="height:240px;" />';

preg_match_all('#<img[^>]*>#mui', $str, $images, PREG_SET_ORDER);

foreach ($images as $img) {
    preg_match('#src="[^"]+"#mui', $img[0],            $m_src);
    preg_match('#height="\d+"#mui', $img[0],           $m_height);
    preg_match('#style="height:\d+px;?"#mui', $img[0], $m_st_height);

    var_dump('<pre>',$m_src[0], $m_height[0], $m_st_height[0], '<hr></pre>');
}

DEMO

Agnius Vasiliauskas
  • 10,935
  • 5
  • 50
  • 70