How to replace umlaut characters or Unaccent in PHP?

Question

I have a name "GÃ¶ran" and I want it to be converted to "Goran" which means I need to unaccent the particular word. But What I have tried doesn't seem to unaccent all the words.

This is the code I ve used to Unaccent :

private function Unaccent($string)
{
    return preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml|caron);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));
}

The places where is not working(incorrect matching) : I mean it is not giving the expected result on the right hand side,

JÃƒÅ’rgen => Juergen
InÃƒÅ¡s => Ines

The place where it is working(correct matching):

GÃ¶ran => Goran
JÃ¸rgen Ole => Jorgen
JÃ©rÃ´me => Jerome

What could be the reason? How to fix? do you have any better approach to handle all cases?

found this on the web, useful? http://snipplr.com/view/65596/unaccent-a-string/ — MarcDefiant, Oct 11 '12 at 06:22
Stop! JÃ©rÃ´me is not a name, it is badly formatted. Jérôme is correct. — nalply, Oct 11 '12 at 06:30

score 6 · Answer 1 · edited May 23 '17 at 11:48

6

This might be what you are looking for

How to convert special characters to normal characters?

but use "utf-8" instead.

$text = iconv('utf-8', 'ascii//TRANSLIT', $text);

http://us2.php.net/manual/en/function.iconv.php

edited May 23 '17 at 11:48

Community

1
1

answered Oct 11 '12 at 06:21

ohmusama

4,159
5
24
44

This answer is not going to help the OP. – nalply Oct 11 '12 at 08:42

score 2 · Accepted Answer · edited May 23 '17 at 12:00

Short answer

You have two problems:

Firstly. These names are not accented. They are badly formatted.

It seems that you had an UTF-8 file but were working with them using ISO-8559-1. For example if you tell your editor to use ISO-8859-1 and copy-paste the text into a text-area in a browser using UTF-8. Then you saved the badly formatted names in the database. I have seen many such problems arising from copy-paste.

If the names are correctly formatted, then you can solve your second problem. Unaccent them. There is already a question treating this: How to convert special characters to normal characters?

Long answer (focuses on the badly formatted accented letters only)

Why do you have got GÃ¶ran when you want Göran?

Let's begin with Unicode: The letter ö is in Unicode LATIN SMALL LETTER O WITH DIAERESIS. Its Unicode code point is F6 hexadecimal or, respectively, 246 decimal. See this link to the Unicode database.

In ISO-8859-1 code points from 0 to 255 are left as is. The small letter o with diaeresis is saved as only one byte: 246.

UTF-8 and ISO-8859-1 treat the code points 0 to 127 (aka ASCII) the same. They are left as is and saved as only one byte. They differ in the treatment of the code points 128 to 255. UTF-8 can encode the whole Unicode code point set, while ISO-8859-1 can only cope with the first 256 code points.

So, what does UTF-8 do with code points above 128? There is a staggered set of encoding possibilities for code points as they get bigger and bigger. For code points up to 2047 two bytes suffice. They are encoded like this: (see this bit schema)

x xxxx xxxx xxxx => 110xxxxx 10xxxxxx

Let's encode small letter o with diaresis in UTF-8. The bits are: 0 0000 1111 0110 and gets encoded to 11000011 10110110. This is nice.

However, these two bytes can be misunderstood as two valid (!) ISO-8559-1 bytes. What are 11000011 (C3 hex) and 10110110 (B6 hex)? Let's consult an ISO-8859-1 table. C3 is Capital A tilde, and B6 is Paragraph sign. Both signs are valid and no software can detect this misunderstanding by just looking at the bits.

It definitively needs people who know what names look like. GÃ¶ran is just not a name. There is an uppercase letter smack in the middle of the name and the paragraph sign is not a letter at all. Sadly, this misunderstanding does not stop here. Because all characters are valid, they can be copy-pasted and re-rendered. In this process the misunderstanding can be repeated again. Let's do this with Göran. We already misunderstood it once and got a badly formatted GÃ¶ran. The letter Capital A, tilde and the paragraph sign render to two bytes in UTF-8 each (!) and are interpreted as four bytes of gobbledygook, something like GÃƒÅ.ran.

Poor Jürgen! The umlaut ü got mistreated twice and we have JÃƒÅ’rgen.

We have a terrible mess with the umlauts here. It's even possible that the OP got this data as is from his customer. This happened to me once: I got mixed data: well formatted, badly formatted once, twice and thrice in the same file. It's extremely frustrating.

the viewing type has no effect on the internal data PHP is dealing with. That is a browser issue. — ohmusama, Oct 11 '12 at 06:24
@ohmusama: No, that's not true. If you configure your editor with ISO-8859-1, then you get these badly formatted names. — nalply, Oct 11 '12 at 06:25
Actually the thing I am doing was that I am unaccenting the word and looking for the exact match with the word on the right hand side and I am not getting exact match for the words I mentioned in my question. @nalply — user1518659, Oct 11 '12 at 06:33

How to replace umlaut characters or Unaccent in PHP?

2 Answers2