4

I'm having troubles with (german) special characters in URIs and want to try to resolve it with a RegEx Route and a PCRE pattern modifier for UTF-8 u.

'router' => array(
    'routes' => array(
        // ...
        'city' => array(
            'type'  => 'regex',
            'options' => array(
                'regex' => '/catalog/(?<city>[a-zA-Z0-9_-äöüÄÖÜß]*)\/u',
                'defaults' => array(
                    'controller' => 'Catalog\Controller\Catalog',
                    'action'     => 'list-sports',
                ),
                'spec'  => '/catalog/%city%',
            ),
            'may_terminate' => true,
        ),
    ),
),

But when I set it, the route stopps to work at all (error 404) -- neither for URIs with nor to ones without special characters.

How to set the modifier correctly?

Community
  • 1
  • 1
automatix
  • 14,018
  • 26
  • 105
  • 230
  • actually I think there's a problem with the regex router not allowing the modifier – Andrew Mar 27 '13 at 12:23
  • Didn't you say yesterday ([here](http://stackoverflow.com/a/15636028/2019043)), I need to use the UTF-8 modifier in order to get the URI with spacial char working? – automatix Mar 27 '13 at 12:36
  • Yes as your regex wouldn't work, but i've just looked at the code inside the router and I think it's not gtoing to work with the modifier I supplied using regex router :( – Andrew Mar 27 '13 at 13:00
  • 1
    The problem as I see it is that the path evaluated by the Regex route is rawurlencoded, and isn't rawurldecoded until *after* it's evaluated against the regex, so it's never going to match your character list. In addition, you can't pass a modifier to be appended to the regex, so the result of your regex above will be `preg_match('(^/catalog/(?[a-zA-Z0-9_-äöüÄÖÜß]*)\/u$)', $path, $matches);` which is not what you wanted. You may have to write your own regex route handler to solve this. – Crisp Mar 27 '13 at 13:15
  • Exactly. I've just debugged the RegEx Route (Zend\Mvc\Router\Http\Regex). It's the [line 118](https://github.com/zendframework/zf2/blob/master/library/Zend/Mvc/Router/Http/Regex.php#L118), where the matching takes place. The first argument passed to `preg_match(...)` is `'(^' . $this->regex . '$)'` and `$this->regex` is equal to the string the `regex` option is set to. – automatix Mar 27 '13 at 13:52
  • see if my answer below resolves it, it's working locally for me – Crisp Mar 27 '13 at 13:55

2 Answers2

6

Since I already had this open here's a handler that solves the problem.

<?php
namespace Application\Mvc\Router\Http;

use Zend\Mvc\Router\Http\Regex;
use Zend\Mvc\Router\Http\RouteMatch;
use Zend\Stdlib\RequestInterface as Request;

class UnicodeRegex extends Regex
{
    /**
     * match(): defined by RouteInterface interface.
     *
     * @param  Request $request
     * @param  integer $pathOffset
     * @return RouteMatch
     */
    public function match(Request $request, $pathOffset = null)
    {
        if (!method_exists($request, 'getUri')) {
            return null;
        }

        $uri  = $request->getUri();
        // path decoded before match
        $path = rawurldecode($uri->getPath());

        // regex with u modifier    
        if ($pathOffset !== null) {
            $result = preg_match('(\G' . $this->regex . ')u', $path, $matches, null, $pathOffset);
        } else {
            $result = preg_match('(^' . $this->regex . '$)u', $path, $matches);
        }

        if (!$result) {
            return null;
        }

        $matchedLength = strlen($matches[0]);

        foreach ($matches as $key => $value) {
            if (is_numeric($key) || is_int($key) || $value === '') {
                unset($matches[$key]);
            } else {
                $matches[$key] = $value;
            }
        }

        return new RouteMatch(array_merge($this->defaults, $matches), $matchedLength);
    }
}

Assuming you place the file in Application/Mvc/Router/Http/UnicodeRegex your route definition should look like this

'router' => array(
    'routes' => array(
        // ...
        'city' => array(
            'type'  => 'Application\Mvc\Router\Http\UnicodeRegex',
            'options' => array(
                'regex' => '/catalog/(?<city>[\p{L}]+)',
                // or if you prefer, your original regex should work too
                // 'regex' => '/catalog/(?<city>[a-zA-Z0-9_-äöüÄÖÜß]*)',
                'defaults' => array(
                    'controller' => 'Catalog\Controller\Catalog',
                    'action'     => 'list-sports',
                ),
                'spec'  => '/catalog/%city%',
            ),
            'may_terminate' => true,
        ),
    ),
),
Crisp
  • 11,417
  • 3
  • 38
  • 41
  • Having currently a very similar issue with the `Segment` route. As your solution worked so well, could you please take a look. Thank you in advance! – automatix Apr 19 '13 at 09:14
  • I'd like to make it possible to also use a % character.. What do you advice to add to the regex? Simply adding % or \% before or after \p{L} doesn't work.. – ivodvb Jan 28 '15 at 20:38
  • Hmm, did not need that % char to match.. Wanted to match a whitespace URL encoded, but ZF2 let the regex match against urldecoded string. – ivodvb Jan 28 '15 at 20:48
  • I've filed this [as a bug in the ZF2 issue trackeg](https://github.com/zendframework/zf2/issues/7335). A subscribe there might help get this fixed upstream. – Caleb Mar 17 '15 at 13:09
1

Well,

I guess you can solve it as easily as many other ones had this same problem. So take a look at some of them:

UTF-8 in * regular expressions

There uses the following modifiers like \\s, \\p{L}, and \\u to help you. I hope it solves! Good luck.

Edit

See my own test:

<?php

    $toss_the_dice = utf8_decode ("etc/catalog/Nürnberg");
    preg_match ('/\/catalog\/([\\s\\p{L}]*)/m', $toss_the_dice, $dice);
    echo utf8_encode ($dice[1]);

// Now it prints
// Nürnberg

?>

Can you realize?

Edit 2

It can be better for you!

<?php
    $toss_the_dice = "etc/catalog/Nürnberg";
    preg_match ('/\/catalog\/([\\s\\p{L}]*)/u', $toss_the_dice, $dice);
    echo $dice[1];

// Now it also prints
// Nürnberg

?>
Community
  • 1
  • 1
  • 1
    Thank you, but what I've asked about is: how to set it **in this concret context of ZF2 RegeEx Route settings** -- and not in general! I have set it like this: `'regex' => '/catalog/(?[a-zA-Z0-9_-äöüÄÖÜß]*)\/u',`, it hasn't worked, so my question is: how to add the UTF-8 modifier to my `'/catalog/(?[a-zA-Z0-9_-äöüÄÖÜß]*)'` so, that it works? – automatix Mar 27 '13 at 12:28
  • You're welcome, mate. So, I think it should be solved without "Zend" ones. For instance, I meant you should replace the "äöüÄÖÜß" for the properly way. It all concerns RegEx patterns. Did you try to apply **\\p{L}** on it? – Seiji Manoan Seo Mar 27 '13 at 12:48
  • I've tried this out: `'regex' => '/catalog/(?[\p{L}]*)',` It works for "Berlin", but doen't work for "Nürnberg". – automatix Mar 27 '13 at 13:16