2

I have a name "Göran" and I want it to be converted to "Goran" which means I need to unaccent the particular word. But What I have tried doesn't seem to unaccent all the words.

This is the code I ve used to Unaccent :

private function Unaccent($string)
{
    return preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml|caron);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));
}

The places where is not working(incorrect matching) : I mean it is not giving the expected result on the right hand side,

JÃŒrgen => Juergen
InÚs => Ines

The place where it is working(correct matching):

Göran => Goran
Jørgen Ole => Jorgen
Jérôme => Jerome

What could be the reason? How to fix? do you have any better approach to handle all cases?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
user1518659
  • 2,198
  • 9
  • 29
  • 40

2 Answers2

6

This might be what you are looking for

How to convert special characters to normal characters?

but use "utf-8" instead.

$text = iconv('utf-8', 'ascii//TRANSLIT', $text);

http://us2.php.net/manual/en/function.iconv.php

Community
  • 1
  • 1
ohmusama
  • 4,159
  • 5
  • 24
  • 44
2

Short answer

You have two problems:

Firstly. These names are not accented. They are badly formatted.

It seems that you had an UTF-8 file but were working with them using ISO-8559-1. For example if you tell your editor to use ISO-8859-1 and copy-paste the text into a text-area in a browser using UTF-8. Then you saved the badly formatted names in the database. I have seen many such problems arising from copy-paste.

If the names are correctly formatted, then you can solve your second problem. Unaccent them. There is already a question treating this: How to convert special characters to normal characters?

Long answer (focuses on the badly formatted accented letters only)

Why do you have got Göran when you want Göran?

Let's begin with Unicode: The letter ö is in Unicode LATIN SMALL LETTER O WITH DIAERESIS. Its Unicode code point is F6 hexadecimal or, respectively, 246 decimal. See this link to the Unicode database.

In ISO-8859-1 code points from 0 to 255 are left as is. The small letter o with diaeresis is saved as only one byte: 246.

UTF-8 and ISO-8859-1 treat the code points 0 to 127 (aka ASCII) the same. They are left as is and saved as only one byte. They differ in the treatment of the code points 128 to 255. UTF-8 can encode the whole Unicode code point set, while ISO-8859-1 can only cope with the first 256 code points.

So, what does UTF-8 do with code points above 128? There is a staggered set of encoding possibilities for code points as they get bigger and bigger. For code points up to 2047 two bytes suffice. They are encoded like this: (see this bit schema)

x xxxx xxxx xxxx => 110xxxxx 10xxxxxx

Let's encode small letter o with diaresis in UTF-8. The bits are: 0 0000 1111 0110 and gets encoded to 11000011 10110110. This is nice.

However, these two bytes can be misunderstood as two valid (!) ISO-8559-1 bytes. What are 11000011 (C3 hex) and 10110110 (B6 hex)? Let's consult an ISO-8859-1 table. C3 is Capital A tilde, and B6 is Paragraph sign. Both signs are valid and no software can detect this misunderstanding by just looking at the bits.

It definitively needs people who know what names look like. Göran is just not a name. There is an uppercase letter smack in the middle of the name and the paragraph sign is not a letter at all. Sadly, this misunderstanding does not stop here. Because all characters are valid, they can be copy-pasted and re-rendered. In this process the misunderstanding can be repeated again. Let's do this with Göran. We already misunderstood it once and got a badly formatted Göran. The letter Capital A, tilde and the paragraph sign render to two bytes in UTF-8 each (!) and are interpreted as four bytes of gobbledygook, something like GÃÅ.ran.

Poor Jürgen! The umlaut ü got mistreated twice and we have JÃŒrgen.

We have a terrible mess with the umlauts here. It's even possible that the OP got this data as is from his customer. This happened to me once: I got mixed data: well formatted, badly formatted once, twice and thrice in the same file. It's extremely frustrating.

Community
  • 1
  • 1
nalply
  • 26,770
  • 15
  • 78
  • 101
  • 1
    help me on how to fix the problem and unaccent. @nalply – user1518659 Oct 11 '12 at 06:23
  • the viewing type has no effect on the internal data PHP is dealing with. That is a browser issue. – ohmusama Oct 11 '12 at 06:24
  • @ohmusama: No, that's not true. If you configure your editor with ISO-8859-1, then you get these badly formatted names. – nalply Oct 11 '12 at 06:25
  • Actually the thing I am doing was that I am unaccenting the word and looking for the exact match with the word on the right hand side and I am not getting exact match for the words I mentioned in my question. @nalply – user1518659 Oct 11 '12 at 06:33