12

I'm trying to translate the following slugify method from PHP to C#: http://snipplr.com/view/22741/slugify-a-string-in-php/

Edit: For the sake of convenience, here the code from above:

/**
 * Modifies a string to remove al non ASCII characters and spaces.
 */
static public function slugify($text)
{
    // replace non letter or digits by -
    $text = preg_replace('~[^\\pL\d]+~u', '-', $text);

    // trim
    $text = trim($text, '-');

    // transliterate
    if (function_exists('iconv'))
    {
        $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
    }

    // lowercase
    $text = strtolower($text);

    // remove unwanted characters
    $text = preg_replace('~[^-\w]+~', '', $text);

    if (empty($text))
    {
        return 'n-a';
    }

    return $text;
}

I got no probleming coding the rest except I can not find the C# equivalent of the following line of PHP code:

$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);

Edit: Purpose of this is to translate non-ASCII characters such as Reformáció Genfi Emlékműve Előtt into reformacio-genfi-emlekmuve-elott

Trav L
  • 14,732
  • 6
  • 30
  • 39

3 Answers3

14

I would also like to add that the //TRANSLIT removes the apostrophes and that @jxac solution doesn't address that. I'm not sure why but by first encoding it to Cyrillic and then to ASCII you get a similar behavior as //TRANSLIT.

var str = "éåäöíØ";
var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str)); 

=> "eaaoiO"
Jonas Elfström
  • 30,834
  • 6
  • 70
  • 106
  • Thanks so much for this solution! I have been looking for a way to replace non-US-ASCII characters with an ASCII equivalent for an old mainframe system that can't handle these characters. – Annagram Oct 14 '10 at 17:58
  • This just removes accents and does not do actual transliteration. It will loose all non-accented letters in the process. – Egor Pavlikhin Feb 22 '12 at 12:14
  • 2
    I'm not sure what you mean by actual transliteration but it sure does not drop the non-accented letters. `Reformáció Genfi Emlékműve Előtt` => `Reformacio Genfi Emlekmuve Elott` – Jonas Elfström Feb 22 '12 at 12:21
  • 3
    "Привет" however becomes just an empty string. Which is what I said, it drops non-accented non-latin letters. In your example it only removed accents and the rest of the letters are already latin, so no transliteration takes place. – Egor Pavlikhin Apr 03 '12 at 07:22
  • Testing this in .Net 4.6.1 it seems to work well with Nordic characters `åäöæø`, russian and japanease/chinease chars becomes questionmarks ? and it keeps slashes which you cannot use in a slug, so you would need to replace/remove questionmarks and slashes and other invalid url chars and it will depend on what languages you need.. – OZZIE Feb 21 '22 at 09:02
9

There is a .NET library for transliteration on codeplex - unidecode. It generally does the trick using Unidecode tables ported from python.

CharlesB
  • 86,532
  • 28
  • 194
  • 218
ikutsin
  • 1,118
  • 15
  • 22
1

conversion to string:

byte[] unicodeBytes = Encoding.Unicode.GetBytes(str);
byte[] asciiBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, unicodeBytes);
string asciiString = Encoding.ASCII.GetString(asciiBytes);

conversion to bytes:

byte[] ascii = Encoding.ASCII.GetBytes(str);

@Thomas Levesque is right, will get encoded by the output stream...

to remove the diacritics (accent marks), you can use the String.Normalize function, as detailed here:

http://www.siao2.com/2007/05/14/2629747.aspx

that should take care of most of the cases (where the glyph is really a character plus an accent mark). for an even more aggressive char matching (to take care of cases like the Scandinavian slashed o [Ø], digraphs, and other exotic glyphs), there's the table approach:

http://www.codeproject.com/KB/cs/UnicodeNormalization.aspx

this includes around 1,000 symbol mappings in addition to the normalization.

(note, all punctuation is removed by the regex replace in your example)

Community
  • 1
  • 1
user262976
  • 1,994
  • 12
  • 7