Target URLs with special characters

Question

I have a string with HTML, and I target image URLs like this:

$regex = '#([a-z,:=\-_0-9\/\:\.]*\.(jpg|jpeg|png|gif))#i';

Works fine with:

https://example.com/image.jpg

But when a URL has a special character, like:

https://example.com/ストスト.jpg

It doesn't match. See test!

How do I alter the regex so it matches with URLs that have these special characters?

No need of escaping meta characters inside a character class https://stackoverflow.com/questions/19976018/does-a-dot-have-to-be-escaped-in-a-character-class-square-brackets-of-a-regula — nice_dev, Feb 27 '20 at 15:26
You need to look for everything includeing unicode characters - something like https://regex101.com/r/wdabX7/1 — waterloomatt, Feb 27 '20 at 15:27
@waterloomatt Can you post an answer using my regex code as a base? — Henrik Petterson, Feb 27 '20 at 15:28
@njank That only matches the file name and not the whole URL. — Henrik Petterson, Feb 27 '20 at 15:37
Since you tagged the question with PHP you might want to try checking the URL with an [endsWith()](https://stackoverflow.com/questions/834303/startswith-and-endswith-functions-in-php) function. — Cray, Feb 27 '20 at 16:05
Use the `u` flag and use `$regex = '#[\w,=/:.-]+\.(?:jpe?g|png|gif)#iu`; — The fourth bird, Feb 27 '20 at 17:08
Does this answer your question? [PHP regex to accept Japanese and english languages](https://stackoverflow.com/questions/50586145/php-regex-to-accept-japanese-and-english-languages) — Mike Doe, Feb 27 '20 at 17:46

The fourth bird · Accepted Answer · 2020-02-27T17:44:59.480

In the character class you don't have to escape the , and the :. You also don't have to escape the / if you use a different delimiter like #.

You could shorten the pattern to

[\w,=/:.-]+\.(?:jpe?g|png|gif)

Regex demo | Php demo

If you want to find the href from the anchors, I suggest using a parser instead.

The pattern including the u unicode flag:

$regex = '#[\w,=/:.-]+\.(?:jpe?g|png|gif)#iu

For example (using anchors ^ and $ to prevent getting partial matches)

$input = <<<HTML
<a href="https://e...content-available-to-author-only...e.com/example1.jpg">
<a href="https://e...content-available-to-author-only...e.com/ストスト.jpg">
<a href="https://e...content-available-to-author-only...e.com/example3.jpg">
<a href="https://e...content-available-to-author-only...e.com/example3.bak">
HTML;

$dom = new DomDocument();
$dom->loadHTML(mb_convert_encoding($input, 'HTML-ENTITIES', "UTF-8"));

$anchors = $dom->getElementsByTagName("a");
$regex = '#^[\w,=/:.-]+\.(?:jpe?g|png|gif)$#iu';

foreach ($anchors as $anchor) {
    $res = $anchor->getAttribute("href");
    if (preg_match($regex, $res)) {
        echo "Valid url: $res" . PHP_EOL;
    } else {
        echo "Invalid url: $res" . PHP_EOL;
    }
}

Output

Valid url: https://e...content-available-to-author-only...e.com/example1.jpg
Valid url: https://e...content-available-to-author-only...e.com/ストスト.jpg
Valid url: https://e...content-available-to-author-only...e.com/example3.jpg
Invalid url: https://e...content-available-to-author-only...e.com/example3.bak

Alcaeus D · Answer 2 · 2020-02-27T16:11:47.370

You could always try to use a unicode flag on the regex and see if those characters are matched or not, like this:

$regex = '#([a-zストスト,:=\-_0-9\/\:\.]*\.(jpg|jpeg|png|gif))#iu';

notice the u at the end of the regex ( it refers to unicode )

Obviously, you can add the range of your alphabet if its supported.

i.e. like this ス-ト

An other approach could be to add the complete alphabet within the regex, right after your a-z parameter. Check this answer also.

Hope it helps!

EDIT:

Based on your comment that refers to any foreign character, the best thing I can think of, is to use the parameter \w which means every word character, and add the u flag at the end of your regex.

This means that it could be $regex = '#([\w,:=\-_0-9\/\:\.]*\.(jpg|jpeg|png|gif))#iu';

With this regex, your 2 examples work fine. Waiting for your response :)

But my target is not specifically these characters `ストスト` but all foreign language characters. Can you please edit your answer on how I would do that using my regex code as reference? — Henrik Petterson, Feb 27 '20 at 15:42

score 0 · Answer 3 · answered Feb 27 '20 at 15:49

0

'#([\p{L},:=\-_0-9\/\:\.]*\.(jpg|jpeg|png|gif))#i'

This works for all foreign language characters. Hope this could help

answered Feb 27 '20 at 15:49

amamou nesrine

13
4

Doesn't seem to work, please [see this](https://ideone.com/U1MH2A)... – Henrik Petterson Feb 27 '20 at 15:56
$domain is Undefined – amamou nesrine Feb 27 '20 at 16:09

Target URLs with special characters

3 Answers3