HTML encoding of Japanese text

Question

I'm making a static HTML page that displays courtesy text in multiple languages. I noticed that if I paste ウェブサイトのメンテナンスの下で into Expression Blend, that text appears the same in the code. I think it's bad for compatibility and should be replaced by proper HTML entities.

I have tried http://www.opinionatedgeek.com/DotNet/Tools/HTMLEncode/encode.aspx but it returns me the same Japanese text.

Is it correct, from the point of view of browser compatibility, to paste that Japanese right into the source code of an HTML page?
Else, what is the correct HTML encoding of that text? Or, better, is there any tool that I can use to convert non-ASCII characters to HTML entities, possibly online and possibly free?

How about encoding and serving Unicode HTML pages? That way, you don't have to worry about HTML entities. — Andrew, Sep 28 '12 at 23:46
On a complete sidenote... that Japanese text is slightly strange. Maybe something more like ただいまメンテナンス中です or 只今メンテナンス中です ? — jkerian, Sep 16 '15 at 14:24
I don't even remember what that text was for and why did I paste it into Blend :-) — usr-local-ΕΨΗΕΛΩΝ, Sep 16 '15 at 14:30

score 2 · Accepted Answer · edited May 23 '17 at 12:15

I think it's bad for compatibility and should be replaced by proper HTML entities.

Quite the opposite actually, your preference should be to not use html entities but rather correctly declare document encoding as UTF-8 and use the actual characters. There are quite a few compelling reasons to do so, but the real question is why not use it since it's a well- and widely supported standard?

Some of those points have been summarised previously:

UTF-8 encodings are easier to read and edit for those who understand what the character means and know how to type it.

UTF-8 encodings are just as unintelligible as HTML entity encodings for those who don't understand them, but they have the advantage of rendering as special characters rather than hard to understand decimal or hex encodings.

[For example] Wikipedia... actually go through articles and convert character entities to their corresponding real characters for the sake of user-friendliness and searchability.

score 1 · Answer 2 · answered Sep 29 '12 at 01:35

As long as you mark your web-page as UTF-8, either in the http headers or the meta tags, having foreign characters in your web-pages should be a non-issue. Alternately you could encode/decode these strings using encodeURI/decodeURI functions in JavaScript

encodeURI('ウェブサイトのメンテナンスの下で')
//returns"%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AE%E3%83%A1%E3%83%B3%E3%83%86%E3%83%8A%E3%83%B3%E3%82%B9%E3%81%AE%E4%B8%8B%E3%81%A7"

decodeURI("%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AE%E3%83%A1%E3%83%B3%E3%83%86%E3%83%8A%E3%83%B3%E3%82%B9%E3%81%AE%E4%B8%8B%E3%81%A7")
//returns ウェブサイトのメンテナンスの下で

If you are looking for a tool to convert a bunch of static strings to unicode characters, you could simply use encodeURI/decodeURI functions from a web-page developer console (firebug for mozilla/firefox). Hope this helps!

score 0 · Answer 3 · answered Sep 29 '12 at 01:46

HTML entities are only useful if you need to represent a character that cannot be represented in the encoding your document is saved in. For example, ASCII has no specification for how to represent "€". If you want to use that character in an ASCII encoded HTML document, you have to encode it as € or not use it at all.

If you are using a character encoding for your document that can represent all the characters you need though, like UTF-8, there's no need for HTML entities. You simply need to make sure the browser knows what encoding the document is in so it can interpret it correctly. This is really the preferable method, since it simply keeps the source code readable. It really makes no sense to want to work with HTML entities if you can simply work with the actual characters.

See http://kunststube.net/frontback for some more information.

HTML encoding of Japanese text

3 Answers3