"â€™" showing on page instead of " ' "

Question

â€™ is showing on my page instead of '.

I have the Content-Type set to UTF-8 in both my <head> tag and my HTTP headers:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

enter image description here

In addition, my browser is set to Unicode (UTF-8):

enter image description here

So what's the problem, and how can I fix it?

See "Mojibake" in https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored — Rick James, Feb 11 '20 at 06:12

BalusC · Answer 1 · 2023-04-08T12:19:28.337

So what's the problem,

It's a ’ (RIGHT SINGLE QUOTATION MARK - U+2019) character which is being decoded as CP-1252 instead of UTF-8. If you check the Encodings table of this character at FileFormat.Info, then you see that this character is in UTF-8 composed of bytes 0xE2, 0x80 and 0x99.

And if you check the CP-1252 code page layout at Wikipedia, then you'll see that the hex bytes E2, 80 and 99 stand for the individual characters â, € and ™.

and how can I fix it?

Use UTF-8 instead of CP-1252 to read, write, store, and display the characters.

I have the Content-Type set to UTF-8 in both my <head> tag and my HTTP headers:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

This only instructs the client which encoding to use to interpret and display the characters. This doesn't instruct your own program which encoding to use to read, write, store, and display the characters in. The exact answer depends on the server side platform / database / programming language used. Do note that the one set in HTTP response header has precedence over the HTML meta tag. The HTML meta tag would then only be used when the page is opened from local disk file system via a file:// URL instead of from the web via a http(s):// URL.

In addition, my browser is set to Unicode (UTF-8):

This only forces the client which encoding to use to interpret and display the characters. But the actual problem is that you're already sending the exact characters â€™ (encoded in UTF-8) to the client instead of the character ’. The client is basically correctly displaying â€™ using the UTF-8 encoding. If the client was misinstructed to use for example ISO-8859-1 to display them, then you would likely have seen Ã¢â¬â¢ instead.

I am using ASP.NET 2.0 with a database.

This is most likely where your problem lies. You need to verify with an independent database tool what the data looks like.

If the ’ character is correctly there, then you are most likely not correctly connecting to the database from your program. You basically need to reconfigure the database connector to use UTF-8. How to do that depends on the database being used.

Or if your database already contains â€™, then it's your database that's messed up. Most probably the tables aren't configured to use UTF-8. Instead, they use the database's default encoding, which varies depending on the configuration. If this is your issue, then usually just altering the table to use UTF-8 is sufficient. If your database doesn't support that, you'll need to recreate the tables. It is good practice to set the encoding of the table when you create it.

You're most likely using SQL Server, but here is some MySQL code (copied from this article):

CREATE DATABASE db_name CHARACTER SET utf8;
CREATE TABLE tbl_name (...) CHARACTER SET utf8;

If your table is however already UTF-8, then you need to take a step back. Who or what put the data there. That's where the problem is. One example would be HTML form submitted values which are incorrectly encoded/decoded.

Here are some more links to learn more about the problem:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), from our own Joel.
Unicode - How to get the characters right?, with more concise and practical information, solutions are targeted on Java environments.
How to setup your PHP site to use UTF8, targeted on PHP environments.

If you have broken content like this saved somewhere eg in a mysql database, http://stackoverflow.com/a/9407998/117647 has the trick you need to convert the characters to utf-8 — Steve, Jun 01 '16 at 08:18
TL;DR; **Use UTF-8 to read, write, store, and display the characters.** — c0degeas, Nov 22 '18 at 10:24
Note that the iso-8859-1 and Windows-1252 tables overlap, so some "strange characters combinations" are common to both (e.g. "Ã©" for "é"). — Skippy le Grand Gourou, Feb 19 '19 at 20:20
Our third party sends requests to our webservice (email content) with headers that claims the information is UTF-8, but I'm finding characters like these. Is there a way to resolve this issue? — Tristen Woodruff, Jul 13 '23 at 18:47

kennytm · Accepted Answer · 2014-02-13T09:22:31.283

63

Ensure the browser and editor are using UTF-8 encoding instead of ISO-8859-1/Windows-1252.

Or use ’.

edited Feb 13 '14 at 09:22

answered Mar 19 '10 at 13:06

kennytm

510,854
105
1,084
1,005

83

No, it is not solved. There's still an inconsistency in character encoding in your application. You will re-encounter the same problem in the future for other non-CP1252 characters. And there's quite a lot of them ... – BalusC Mar 19 '10 at 13:51
13

Examples of characters that you'll continue to encounter: http://www.i18nqa.com/debug/utf8-debug.html – Zoot Jan 28 '14 at 16:38

Remy Lebeau · Answer 3 · 2015-06-19T00:07:50.453

’ (Unicode codepoint U+2019 RIGHT SINGLE QUOTATION MARK) is encoded in UTF-8 as bytes:

0xE2 0x80 0x99.

â€™ (Unicode codepoints U+00E2 U+20AC U+2122) is encoded in UTF-8 as bytes:

0xC3 0xA2 0xE2 0x82 0xAC 0xE2 0x84 0xA2.

These are the bytes your browser is actually receiving in order to produce â€™ when processed as UTF-8.

That means that your source data is going through two charset conversions before being sent to the browser:

The source ’ character (U+2019) is first encoded as UTF-8 bytes:

0xE2 0x80 0x99
those individual bytes were then being mis-interpreted and decoded to Unicode codepoints U+00E2 U+20AC U+2122 by one of the Windows-125X charsets (1252, 1254, 1256, and 1258 all map 0xE2 0x80 0x99 to U+00E2 U+20AC U+2122), and then those codepoints are being encoded as UTF-8 bytes:

0xE2 -> U+00E2 -> 0xC3 0xA2
0x80 -> U+20AC -> 0xE2 0x82 0xAC
0x99 -> U+2122 -> 0xE2 0x84 0xA2

You need to find where the extra conversion in step 2 is being performed and remove it.

score 18 · Answer 4 · edited Feb 24 '15 at 02:55

I have some documents where … was showing as â€¦ and ê was showing as Ãª. This is how it got there (python code):

# Adam edits original file using windows-1252
windows = '\x85\xea' 
# that is HORIZONTAL ELLIPSIS, LATIN SMALL LETTER E WITH CIRCUMFLEX

# Beth reads it correctly as windows-1252 and writes it as utf-8
utf8 = windows.decode("windows-1252").encode("utf-8")
print(utf8)

# Charlie reads it *incorrectly* as windows-1252 writes a twingled utf-8 version
twingled = utf8.decode("windows-1252").encode("utf-8")
print(twingled)

# detwingle by reading as utf-8 and writing as windows-1252 (it's really utf-8)
detwingled = twingled.decode("utf-8").encode("windows-1252")

assert utf8==detwingled

To fix the problem, I used python code like this:

with open("dirty.html","rb") as f:
    dt = f.read()
ct = dt.decode("utf8").encode("windows-1252")
with open("clean.html","wb") as g:
    g.write(ct)

(Because someone had inserted the twingled version into a correct UTF-8 document, I actually had to extract only the twingled part, detwingle it and insert it back in. I used BeautifulSoup for this.)

It is far more likely that you have a Charlie in content creation than that the web server configuration is wrong. You can also force your web browser to twingle the page by selecting windows-1252 encoding for a utf-8 document. Your web browser cannot detwingle the document that Charlie saved.

Note: the same problem can happen with any other single-byte code page (e.g. latin-1) instead of windows-1252.

score 18 · Answer 5 · answered Jul 15 '16 at 09:05

This sometimes happens when a string is converted from Windows-1252 to UTF-8 twice.

We had this in a Zend/PHP/MySQL application where characters like that were appearing in the database, probably due to the MySQL connection not specifying the correct character set. We had to:

Ensure Zend and PHP were communicating with the database in UTF-8 (was not by default)

Repair the broken characters with several SQL queries like this...

UPDATE MyTable SET 
MyField1 = CONVERT(CAST(CONVERT(MyField1 USING latin1) AS BINARY) USING utf8),
MyField2 = CONVERT(CAST(CONVERT(MyField2 USING latin1) AS BINARY) USING utf8);

Do this for as many tables/columns as necessary.

You can also fix some of these strings in PHP if necessary. Note that because characters have been encoded twice, we actually need to do a reverse conversion from UTF-8 back to Windows-1252, which confused me at first.

mb_convert_encoding('â€™', 'Windows-1252', 'UTF-8');    // returns ’

Awsome. !! I tried whole internet nothing work, only this :) thanks bro... — Shurvir Mori, Sep 11 '21 at 09:02
Great and many thanks ! I was about to become crazy about this encoding issue ! — Philippe, Jul 05 '22 at 16:36

score 11 · Answer 6 · edited Apr 25 '16 at 15:41

You have a mismatch in your character encoding; your string is encoded in one encoding (UTF-8) and whatever is interpreting this page is using another (say ASCII).

Always specify your encoding in your http headers and make sure this matches your framework's definition of encoding.

Sample http header:

Content-Type    text/html; charset=utf-8

Setting encoding in asp.net

<configuration>
  <system.web>
    <globalization
      fileEncoding="utf-8"
      requestEncoding="utf-8"
      responseEncoding="utf-8"
      culture="en-US"
      uiCulture="de-DE"
    />
  </system.web>
</configuration>

Setting encoding in jsp

score 8 · Answer 7 · edited Dec 28 '13 at 06:59

If your content type is already UTF8 , then it is likely the data is already arriving in the wrong encoding. If you are getting the data from a database, make sure the database connection uses UTF-8.

If this is data from a file, make sure the file is encoded correctly as UTF-8. You can usually set this in the "Save as..." Dialog of the editor of your choice.

If the data is already broken when you view it in the source file, chances are that it used to be a UTF-8 file but was saved in the wrong encoding somewhere along the way.

score 6 · Answer 8 · answered Mar 08 '16 at 09:13

6

If someone gets this error on WordPress website, you need to change wp-config db charset:

define('DB_CHARSET', 'utf8mb4_unicode_ci');

instead of:

define('DB_CHARSET', 'utf8mb4');

answered Mar 08 '16 at 09:13

Goran Jakovljevic

2,714
1
31
27

1

Thanks Mr. Life Savior – Agent K Aug 31 '21 at 16:29

score 4 · Answer 9 · answered Apr 16 '21 at 13:12

4

If the other answers haven't helped, you might want to check whether your database is actually storing the mojibake characters. I was viewing the text in utf-8, but I was still seeing the mojibake and it turned out that, due to a database upgrade, the text had been permanently "mojibaked".

In this case, one option is to "fix" the text with Python's ftfy package (or JavaScript verion here).

answered Apr 16 '21 at 13:12

joe

3,752
1
32
41

I really needed this answer 5-odd years ago when I was writing a poor copy of the ftfy library. – Michael Aug 30 '21 at 15:22

score 1 · Answer 10 · answered Oct 23 '19 at 00:09

1

In DBeaver (or other editors) the script file you're working can prompt to save as UTF8 and that will change the char:

â€“

into

ÃƒÂ¢Ã¢â€šÂ¬Ã¢â‚¬Å“

or

Ã¢â‚¬â€œ

answered Oct 23 '19 at 00:09

Jeremy Thompson

61,933
36
195
321

score 0 · Answer 11 · answered Sep 04 '15 at 10:41

0

You must have copy/paste text from Word Document. Word document use Smart Quotes. You can replace it with Special Character (’) or simply type in your HTML editor (').

I'm sure this will solve your problem.

answered Sep 04 '15 at 10:41

Kaushal Panchal

21

score -3 · Answer 12 · edited Oct 14 '13 at 09:08

-3

The same thing happened to me with the '–' character (long minus sign).
I used this simple replace so resolve it:

htmlText = htmlText.Replace('–', '-');

edited Oct 14 '13 at 09:08

Radim Köhler

122,561
47
239
335

answered Oct 14 '13 at 08:49

TomerB

27

4

The OP's problem is mojibake, not similar Unicode characters. – Cole Tobin Dec 28 '13 at 07:04

"â€™" showing on page instead of " ' "

12 Answers12

Linked

Related