1

I'm migrating the front-end of a site from an old YUI2 framework to jQuery/BackBone. The PHP/mySQL back-end hasn't changed. All is well, except UTF-8 characters sent via Backbone save (via $.ajax) are getting mangled and I can't figure out why.

Here's what I do know:

  1. The backend handles UTF-8 fine. It hasn't changed as part of this rebuild. I know that's true, because when I change the config to load the old YUI2 front-end, UTF-8 characters work fine. They're escaped in Javascript using escape(string), passed via YAHOO.util.Connect.asyncRequest as JSON in an XMLHttpRequest, unescaped and saved in the database as UTF-8, fully readable and nice.
  2. In the new front-end, I've added <meta charset="UTF-8"> and <meta http-equiv="content-type" content="text/html; charset=UTF-8"> to all page headers. The old front-end didn't have these settings. I only mention that because it's a difference.
  3. In the new front-end, UTF-8 characters work fine when I save them as a <form> submit.
  4. I the new front-end, the request Content-Type looks fine in the console. Content-Type:application/x-www-form-urlencoded; charset=UTF-8

How am I passing data in the new front-end?

  • Sometimes via a regular Backbone model.save(), other times passing data in options like this:

    var text = $('#input-' + targetId).val();
    
    var atts = {};
    atts['target_id'] = targetId;
    atts['user_id'] = userId;
    atts['text'] = text;
    
    var comment = new Comment(atts);
    
    comment.save(
        {},
        {
            type: 'POST',
            url: '/api/comment?',
            data: atts,
            processData: true,
            success: function(comment, response){
               //success handling
            },
            error: function(model, response){
               //error handling
            },
        },
    ); 
    

So, what do these mangled special characters look like?

  • As entered in the input: テクス テクサン テクス テクサン

  • When I pass completely unescaped, they look fine in the request in the console in the Form Data section: text: テクス テクサン テクス テクサン, but mangled in the database as ãã¯ã¹ ãã¯ãµã³ ãã¯ã¹ ãã¯ãµã³. Perhaps this is a clue, I don't know. I've always escaped user-entered text when passing via AJAX.

  • When I escape(text), I get text:%u30C6%u30AF%u30B9%20%u30C6%u30AF%u30B5%u30F3%20%u30C6%u30AF%u30B9%20%u30C6%u30AF%u30B5%u30F3 in the console, and テクス%20テクサン%20テクス%20テクサン in the database.

That's better, but it's different from the old front end, which uses escape(text), passes %u30C6%u30AF%u30B9%20%u30C6%u30AF%u30B5%u30F3%20%u30C6%u30AF%u30B9%20%u30C6%u30AF%u30B5%u30F3, shows in the console as text: (unable to decode value) and saves in the database unescaped as テクス テクサン テクス テクサン

  • Of course, it's 2016 now and we all know escape() should not be used. We should use encodeURIComponent() instead. So, when I encodeURIComponent(text), here's what I get in the console: text: %E3%83%86%E3%82%AF%E3%82%B9%20%E3%83%86%E3%82%AF%E3%82%B5%E3%83%B3%20%E3%83%86%E3%82%AF%E3%82%B9%20%E3%83%86%E3%82%AF%E3%82%B5%E3%83%B3 which is saved in the database as %E3%83%86%E3%82%AF%E3%82%B9%20%E3%83%86%E3%82%AF%E3%82%B5%E3%83%B3%20%E3%83%86%E3%82%AF%E3%82%B9%20%E3%83%86%E3%82%AF%E3%82%B5%E3%83%B3 That technically works, and I can always decodeURIComponent when displaying this text, but that's a real pain and it's just masking the issue.

  • I've also tried unescape(encodeURIComponent(text)) with the following result: text:ãã¯ã¹ ãã¯ãµã³ ãã¯ã¹ ãã¯ãµã³ in the console, ãÂÂã¯ã¹ ãÂÂã¯ãµã³ ãÂÂã¯ã¹ ãÂÂã¯ãµã³ in the database.

It seems that there's some sort of double-encoding going on, or perhaps the back-end was built to handle the specific format that's passed via the YUI2 Async request. I don't know.

Any ideas for what I should try next? What are the best practices?

byron
  • 984
  • 3
  • 14
  • 26

1 Answers1

2

Now that I've had a night to sleep on it, I've realized a few things and I think I've found a solution.

It's clear now that the old front-end wasn't passing data correctly...that's evidenced by the text: (unable to decode value) in the console when sending the request. Somehow, the PHP back-end was able to handle the passed text even though there was no decoding in the api or db storage classes. That's a mystery for another day.

Here's what I did to fix the problem:

  1. Pass text from the front-end as encodeURIComponent(text)
  2. Decode the text in the PHP back-end api using $comment->set_text(urldecode(Request::get('text')));

The text is stored in the DB unescaped as readable UTF-8 characters and I don't need to do anything special on read/display. I will need to add the urldecode to all of my api endpoints on the back-end, but that feels like a solid approach, so I think it's resolved.

I'd be interested to hear thoughts on the use of encodeURIComponent on the front-end and urldecode on the back-end. Is this the best way to solve the problem?

byron
  • 984
  • 3
  • 14
  • 26
  • 1
    You actually never want to use the `escape()` Javascript function. It's broken and has [long been deprecated](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/escape). For how to set up PHP on the server side, read [UTF-8 all the way through](http://stackoverflow.com/questions/279170/utf-8-all-the-way-through). On a broader note, you never want to do any manual data encoding at all. Libraries have been written on the client and the server that encode/decode data automatically for you. Use them. – Tomalak Feb 17 '16 at 19:54
  • @Tomalek, for front-end/Javascript, what libraries are you talking about? – byron Feb 17 '16 at 22:47
  • Any library that does Ajax (or even higher abstractions, like the stuff Backbone provides) will encode values for you. You only work with data itself, the underlying library manages UTF-8 or JSON encoding. If you find yourself building your own encoding routines then it's likely that you are either missing API functions in the libraries that you already use - or that you should add a library that does it. – Tomalak Feb 18 '16 at 07:19