3

I have a string with HTML, and I target image URLs like this:

$regex = '#([a-z,:=\-_0-9\/\:\.]*\.(jpg|jpeg|png|gif))#i';

Works fine with:

https://example.com/image.jpg

But when a URL has a special character, like:

https://example.com/ストスト.jpg

It doesn't match. See test!

How do I alter the regex so it matches with URLs that have these special characters?

Henrik Petterson
  • 6,862
  • 20
  • 71
  • 155

3 Answers3

1

In the character class you don't have to escape the , and the :. You also don't have to escape the / if you use a different delimiter like #.

You could shorten the pattern to

[\w,=/:.-]+\.(?:jpe?g|png|gif)

Regex demo | Php demo

If you want to find the href from the anchors, I suggest using a parser instead.

The pattern including the u unicode flag:

$regex = '#[\w,=/:.-]+\.(?:jpe?g|png|gif)#iu

For example (using anchors ^ and $ to prevent getting partial matches)

$input = <<<HTML
<a href="https://e...content-available-to-author-only...e.com/example1.jpg">
<a href="https://e...content-available-to-author-only...e.com/ストスト.jpg">
<a href="https://e...content-available-to-author-only...e.com/example3.jpg">
<a href="https://e...content-available-to-author-only...e.com/example3.bak">
HTML;

$dom = new DomDocument();
$dom->loadHTML(mb_convert_encoding($input, 'HTML-ENTITIES', "UTF-8"));

$anchors = $dom->getElementsByTagName("a");
$regex = '#^[\w,=/:.-]+\.(?:jpe?g|png|gif)$#iu';

foreach ($anchors as $anchor) {
    $res = $anchor->getAttribute("href");
    if (preg_match($regex, $res)) {
        echo "Valid url: $res" . PHP_EOL;
    } else {
        echo "Invalid url: $res" . PHP_EOL;
    }
}

Output

Valid url: https://e...content-available-to-author-only...e.com/example1.jpg
Valid url: https://e...content-available-to-author-only...e.com/ストスト.jpg
Valid url: https://e...content-available-to-author-only...e.com/example3.jpg
Invalid url: https://e...content-available-to-author-only...e.com/example3.bak
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

You could always try to use a unicode flag on the regex and see if those characters are matched or not, like this:

$regex = '#([a-zストスト,:=\-_0-9\/\:\.]*\.(jpg|jpeg|png|gif))#iu';

notice the u at the end of the regex ( it refers to unicode )

Obviously, you can add the range of your alphabet if its supported.

i.e. like this ス-ト

An other approach could be to add the complete alphabet within the regex, right after your a-z parameter. Check this answer also.

Hope it helps!

EDIT:

Based on your comment that refers to any foreign character, the best thing I can think of, is to use the parameter \w which means every word character, and add the u flag at the end of your regex.

This means that it could be $regex = '#([\w,:=\-_0-9\/\:\.]*\.(jpg|jpeg|png|gif))#iu';

With this regex, your 2 examples work fine. Waiting for your response :)

Alcaeus D
  • 258
  • 3
  • 18
  • But my target is not specifically these characters `ストスト` but all foreign language characters. Can you please edit your answer on how I would do that using my regex code as reference? – Henrik Petterson Feb 27 '20 at 15:42
0
'#([\p{L},:=\-_0-9\/\:\.]*\.(jpg|jpeg|png|gif))#i'

This works for all foreign language characters. Hope this could help