0

I would like to find a regular expression that could find (in given HTML) the following images:

  • Those captured in: src=""
  • Those captured in: src=''
  • Those captured in: background=""
  • Those captured in: background=''
  • Those captured in: url("")
  • Those captured in: url('')
  • Those captured in: url()

So far i came up with:

preg_match_all("/src=((\"|'|)?(.*\.(png|gif|jpg))(\"|'|))/Ui", $strHTML, $arrMatches);

preg_match_all("/background=((\"|'|)?(.*\.(png|gif|jpg))(\"|'|))/Ui", $strHTML, $arrMatches);

preg_match_all("/url\((\"|'|)?((.*\.(png|gif|jpg))(\"|'|))\)/Ui", $strHTML, $arrMatches);

But those are incomplete in that they don't include the prefix (src/background/url). Also, security wise I think they can be improved further, to prevent somebody from entering src="http://somesite.com/someurl.exe?ext=jpg"

Any help in the right direction is appreciated.

edit:

I think i got it, although the code can surely be improved, possibly even combined and/or optimized :)

/* match CSS url() links */

preg_match_all("/(url\((\"|'|)(.*\.(png|gif|jpg|jpeg))(\"|'|)\))/Ui", $strHTML, $arrMatches);

Array
(
    [0] => Array
        (
            [0] => url('test1.gif')
            [1] => url(test2.gif)
            [2] => url("test3.gif")
        )

    [1] => Array
        (
            [0] => url('test1.gif')
            [1] => url(test2.gif)
            [2] => url("test3.gif")
        )

    [2] => Array
        (
            [0] => '
            [1] => 
            [2] => "
        )

    [3] => Array
        (
            [0] => test1.gif
            [1] => test2.gif
            [2] => test3.gif
        )

    [4] => Array
        (
            [0] => gif
            [1] => gif
            [2] => gif
        )

    [5] => Array
        (
            [0] => '
            [1] => 
            [2] => "
        )

)

/* match img links */
preg_match_all("/(src=(\"\'??)(.*\.(png|gif|jpg|jpeg))(\"\'??))/Ui", $strHTML, $arrMatches);

/* match background links */
preg_match_all("/(background=(\"\'??)(.*\.(png|gif|jpg|jpeg))(\"\'??))/Ui", $strHTML, $arrMatches);
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Gilles
  • 351
  • 1
  • 4
  • 7
  • Can you clarify by posting your expected output? I'm not sure what you mean by "they don't include the prefix". Also, its difficult to give advice on security with no context for how the code is used. But I can say that you should not rely on a regular expression to prevent malicious code from being injected to your application. – Casey Kinsey Feb 11 '12 at 09:09
  • possible duplicate of [Grabbing the href attribute of an A element](http://stackoverflow.com/questions/3820666/grabbing-the-href-attribute-of-an-a-element) – Gordon Feb 11 '12 at 09:22
  • possible duplicate of [Parse Inline CSS Values with Regex](http://stackoverflow.com/questions/4432334/parse-inline-css-values-with-regex) – Gordon Feb 11 '12 at 09:24
  • The reason i am asking this is that i would like to find all images in HTML using above tags, and replace them with "cid:". I got that part covered already. So i would like to replace `src="/relative/path/img.jpg"` but NOT `src="http://somesite/relative/path/img.jpg` and all variants, therefore, in the expected result i would prefer seeing an array which not only contains the url (e.g. /relative/path/img.jpg) but also `src="/relative/path/img.jpg"` – Gilles Feb 11 '12 at 09:50
  • so i can replace that, but leaving the HTTP url alone (since they share '/relative/path/img.jpg') meaning i would end up with: `src="http://somesite/cid:..."` which obviously won't work. – Gilles Feb 11 '12 at 09:52

1 Answers1

4

If you're sure about those attribute names (src,url and background)...

$arr = array(
    'url("http://somesite.com/someurl.exe?src=jpg")',
    'url(http://somesite.com/someurl.exe?src=jpg)',
    'src="http://somesite.com/someurl.exe?src=jpg"',
    'src="http://somesite.com/someurl.exe?ext=jpg"',
    'background="http://somesite.com/someurl.exe?src=jpg"'
);
foreach ($arr as $str) {
    preg_match_all('/(?<=src=|background=|url\()(\'|")?(?<image>.*?)(?=\1|\))/i',$str,$matches);
    echo $str;
    foreach($matches['image'] as $img) {
        echo "\nimage: <b>$img</b>\n";
    }
    echo "\n";
}
inhan
  • 7,394
  • 2
  • 24
  • 35