Convert japanese to HTML entities

Question

When sending a form with japanese chars through this ajax function, the chars are sent to the server in japanese format and the data are stored as ¿ in the database.

var strAction = "/_ajax/save/"+sSavePage+"?action=saveseo&intFolderID="+iFolderID+"&intPageID="+iPageID;
var frm = $("#frmSmartPage");    
var data = frm.serialize();

$.ajax({
    type: frm.attr('method'),
    url: strAction,
    data: data,
    success: function (data) {
        alert('ok');
    }
});

On the same page the form can also be posted through a submit. The japansese chars are then converted to &#<number> format.

<form method="post" target="ajax_save" autocomplete="off" name="frmSmartPage" id="frmSmartPage" action="<%=constBetaPath%>/_ajax/save/pages_save.asp?intPageID=<%=intPageID%>&intFolderID=<%=intFolderID%>&action=save" onSubmit="return validateSave()">

I would prefer to be able to convert the japanese chars to the &#<number> format in the ajax call, but so far I havn't had any luck.

Things I've allready tried:

var data = unescape(encodeURIComponent(frm.serialize()));
---
var data = escape(frm.serialize());
---
accepts: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
---
contentType: 'application/x-www-form-urlencoded;' 
---
contentType: 'application/x-www-form-urlencoded; charset=UTF-8'

EDIT:

Html encoding:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

EDIT 2:

Backend code is decoding the iso-8859-1 to UTF8

'******************************************************************************************************************
'' @SDESCRIPTION:   Decodes from ISO-8859-1 to UTF8
'' @PARAM:          - s [string]: your string to be decoded
'' @RETURN:         [string] decoded string
'' @DESCRIPTION:    Usefull to use when saving special chars from a ISO-8859-1 post to an UTF-8 page, example via AJAX
'******************************************************************************************************************
public function DecodeUTF8(s)
  dim i
  dim c
  dim n

  s = s + " "

  i = 1
  do while i <= len(s)
    c = asc(mid(s,i,1))
    if c and &H80 then
      n = 1
      do while i + n < len(s)
        if (asc(mid(s,i+n,1)) and &HC0) <> &H80 then
          exit do
        end if
        n = n + 1
      loop
      if n = 2 and ((c and &HE0) = &HC0) then
        c = asc(mid(s,i+1,1)) + &H40 * (c and &H01)
      else
        c = 191 
      end if
      s = left(s,i-1) + chr(c) + mid(s,i+n)
    end if
    i = i + 1
  loop
  DecodeUTF8 = Left(s, Len(s)-1)
end function

SOLUTION Thanks to Álvaro González reply I was able to create a workaround, by creating a temp form to use for submiting.

var strAction = "/_ajax/save/"+sSavePage+"?action=saveseo&intFolderID="+iFolderID+"&intPageID="+iPageID;
var newForm = $('<form />');
var orginalForm = $("#frmSmartPage");

newForm.append(orginalForm.clone().children());
newForm.attr('method', 'post');
newForm.attr('target', 'ajax_save');
newForm.attr('action', strAction);
newForm.css('display', 'none');

orginalForm.parent().append(newForm);

var target = $("#ajax_save");

target.one('load', function () {
    newForm.remove();  
});

newForm.submit();

Do you by chance know or can determine the encoding of any of the involved actors (source code, HTML, database...)? Text encoding is a plain simple concept but it's impossible to get right by pure guessing. — Álvaro González, Aug 10 '16 at 09:27
Are you asking to convert japanese characters to HTML Entities? If so, check out [this table](http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml) and then use [this](http://www.amp-what.com/unicode/search/30a2) to find the HTML entity for the character. — evolutionxbox, Aug 10 '16 at 09:41
For reference, since you don't seem to know, the `;` format is known as "HTML entities". Knowing this term will make things easier for you to search for similar questions here and other info across the web. However your best solution here will be to get everything encoded using UTF-8, from the start to the end. Don't mess around trying to convert encodings or use entities; it always ends in tears and frustration. — Simba, Aug 10 '16 at 09:43
@JimmyMattsson: That's the only encoding that doesn't really matter if you are using HTML numeric character references anyway :-/ — Bergi, Aug 10 '16 at 09:44
@Simba Thanks! I knew it in the back of my head but couldn't pull it out on the whole morning — Jimmy Mattsson, Aug 10 '16 at 09:46

score 2 · Accepted Answer · edited May 23 '17 at 12:22

You have a serious root problem: the ISO-8859-1 charset (also known as Latin-1, which should already give you a clue) is designed for the Latin script used by Western Europe languages and can simply not encode Japanese characters. Everywhere else you are using UTF-8, which is the only sensible encoding choice as of today and doesn't have any restriction of this kind, but ISO-8859-1 is the weak link in the chain that makes it all terribly complicated.

To make it worse, I spot some details that worry me. You are using AJAX to send the information and, since AJAX mandates UTF-8, jQuery will take care of converting it to UTF-8 automatically. However, server-side code incorrectly assumes ISO-8859-1 and will make a bogus conversion. If this code is already in Production, it has possibly been corrupting the data you already have.

You basically have two choices:

Switch everything to UTF-8. This will save you all encoding issues in the future but requires a careful migration of current codebase.
Figure out a way to encode Japanese as ISO-8859-1 in client-side code and decode it properly in server-side code. Thankfully, browsers are already aware of the problem and (since HTML is their master language) they normally decide to use HTML-entities (that's what those &#<number> are and come from) when they have to submit a form that contains character not supported by the document encoding.

In this case, what you need to do is to change your server-side code to:
1. Do not make any encoding conversion (data is already UTF-8)
2. Decode HTML entities (taking into account that the string is UTF-8)

I think "*server-side code incorrectly assumes ISO-8859-1 for UTF8 Ajax request*" is the only problem. The downstream with ISO-8859-1 encoding *and entities* is working fine as-is. — Bergi, Aug 10 '16 at 20:44
@Bergi Well, HTML entities are entirely meaningless outside HTML context. `€` is by no means the same as `€`, unless you explicitly decide so. — Álvaro González, Aug 11 '16 at 06:05
Sure, it just sounded like the only downstream was HTML pages. If not, then a proper solution has to be found for those other channels as well. — Bergi, Aug 11 '16 at 07:09

Convert japanese to HTML entities

1 Answers1