17

I am making a swedish website, and swedish letters are å, ä, and ö.

I need to make a string entered by a user to become url-safe with PHP.

Basically, need to convert all characters to underscore, all EXCEPT these:

 A-Z, a-z, 1-9

and all swedish should be converted like this:

'å' to 'a' and 'ä' to 'a' and 'ö' to 'o' (just remove the dots above).

The rest should become underscores as I said.

Im not good at regular expressions so I would appreciate the help guys!

Thanks

NOTE: NOT URLENCODE...I need to store it in a database... etc etc, urlencode wont work for me.

George Claghorn
  • 26,261
  • 3
  • 48
  • 48

9 Answers9

26

This should be useful which handles almost all the cases.

function Unaccent($string)
{
    return preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml|caron);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));
}
user1518659
  • 2,198
  • 9
  • 29
  • 40
23

Use iconv to convert strings from a given encoding to ASCII, then replace non-alphanumeric characters using preg_replace:

$input = 'räksmörgås och köttbullar'; // UTF8 encoded
$input = iconv('UTF-8', 'ASCII//TRANSLIT', $input);
$input = preg_replace('/[^a-zA-Z0-9]/', '_', $input);
echo $input;

Result:

raksmorgas_och_kottbullar
Pär Wieslander
  • 28,374
  • 7
  • 55
  • 54
  • 1
    You should use "UTF-8" like this : `$data = iconv('UTF-8', 'ASCII//TRANSLIT', $data);` - otherwise you might encounter this notice: "Wrong charset, conversion from `UTF8' to `ASCII//TRANSLIT' is not allowed" – Hirnhamster Nov 05 '12 at 22:08
  • Please update your answer to include @Hirnhamster's suggestion. Your missing hyphen in 'UTF-8' is affecting other people. – Timo Mar 11 '14 at 09:35
13
// normalize data (remove accent marks) using PHP's *intl* extension
$data = normalizer_normalize($data);

// replace everything NOT in the sets you specified with an underscore
$data = preg_replace("#[^A-Za-z1-9]#","_", $data);
Chris Forrence
  • 10,042
  • 11
  • 48
  • 64
Jeremy L
  • 7,686
  • 4
  • 29
  • 36
  • 7
    Please mention that `normalizer_normalize()` is part of the _intl_ PHP extension that is not always active. This extension was added to the core in PHP 5.3, but in most linux distributions it isn't active by default. For instance, in Debian it is in the separate package _php5-intl_. If you can't install/activate it, try _ext/iconv_. instead –  Aug 21 '11 at 01:11
  • 1
    @Mytskine I've added the comment. Thanks for pointing that out: it was on my default for me so I didn't give it a second thought. – Jeremy L Dec 22 '11 at 03:08
8

and all swedish should be converted like this:

'å' to 'a' and 'ä' to 'a' and 'ö' to 'o' (just remove the dots above).

Use normalizer_normalize() to get rid of diacritical marks.

The rest should become underscores as I said.

Use preg_replace() with a pattern of [\W] (i.o.w: any character which doesn't match letters, digits or underscore) to replace them by underscores.

Final result should look like:

$data = preg_replace('[\W]', '_', normalizer_normalize($data));
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
5

If intl php extension is enabled, you can use Transliterator like this :

protected function removeDiacritics($string)
{
    $transliterator = \Transliterator::create('NFD; [:Nonspacing Mark:] Remove; NFC;');
    return $transliterator->transliterate($string);
}

To remove other special chars (not diacritics only like 'æ')

protected function removeDiacritics($string)
{
    $transliterator = \Transliterator::createFromRules(
        ':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: NFC;',
        \Transliterator::FORWARD
    );
    return $transliterator->transliterate($string);
}
Rey0bs
  • 1,192
  • 12
  • 19
4

If you're just interested in making things URL safe, then you want urlencode.

Returns a string in which all non-alphanumeric characters except -_. have been replaced with a percent (%) sign followed by two hex digits and spaces encoded as plus (+) signs. It is encoded the same way that the posted data from a WWW form is encoded, that is the same way as in application/x-www-form-urlencoded media type. This differs from the » RFC 1738 encoding (see rawurlencode()) in that for historical reasons, spaces are encoded as plus (+) signs.

If you really want to strip all non A-Z, a-z, 1-9 (what's wrong with 0, by the way?), then you want:

$mynewstring = preg_replace('/[^A-Za-z1-9]/', '', $str);
Dominic Rodger
  • 97,747
  • 36
  • 197
  • 212
  • 1
    If you want to make it safe, then you do want urlencode. The fact you want to store it in a database is beside the point (other than that you will want to escape it for your SQL insertation query in addition to making it url safe). – Quentin Nov 20 '09 at 13:03
  • You just don't understand. He wants it to be safe to use as a URL, but not THAT safe. He would prefer it fails on a space or ampersand. – JohnFx Nov 20 '09 at 13:41
2

as simple as

 $str = str_replace(array('å', 'ä', 'ö'), array('a', 'a', 'o'), $str); 
 $str = preg_replace('/[^a-z0-9]+/', '_', strtolower($str));

assuming you use the same encoding for your data and your code.

user187291
  • 53,363
  • 19
  • 95
  • 127
  • 1
    '/[^a-z0-9]+/i' or '/[^A-Za-z0-9]+/' to ignore case – Salman A Nov 20 '09 at 13:07
  • 1
    strtr is more convenient to "translate" sets of characters, like: $str = strtr($str,"aëïöü","aeiou"); it doesn't use arrays – danii Nov 20 '09 at 13:09
  • 2
    Arrays are cumbercome to maintain a little thousand characters with diacritical marks known at the human world. Just use `normalizer`. – BalusC Nov 20 '09 at 13:13
1

One simple solution is to use str_replace function with search and replace letter arrays.

Mihail Dimitrov
  • 574
  • 3
  • 9
0

You don't need fancy regexps to filter the swedish chars, just use the strtr function to "translate" them, like:

$your_URL = "www.mäåö.com";
$good_URL = strtr($your_URL, "äåöë etc...", "aaoe etc...");
echo $good_URL;

->output: www.maao.com :)

danii
  • 5,553
  • 2
  • 21
  • 23
  • 1
    It is only a maintenance nightmare to cover thousands of those characters known at the human world. – BalusC Nov 20 '09 at 13:26