1

I'm having trouble decoding Greek text when using ajaxed infinite scrolling. It's the first time I'm dealing with non-English data, but as far as I understand every single Greek character needs to be escaped, because otherwise Ajax breaks trying so send the characters.

I make it Ajax-friendly by escaping it with this (PHP):

function utf8ize($d) {  // Encoding workaround

    if(is_array($d)) {

        foreach ($d as $k => $v) {

            $d[$k] = utf8ize($v);
        }

    } elseif (is_string ($d)) {

        return utf8_encode($d);
    }

    return $d;
}

so this

Το γράμμα άλφα (ἄλφα) είναι το πρώτο γράμμα του ελληνικού αλφαβήτου.

becomes this:

Το γÏάμμα άλφα (ἄλφα) είναι το Ï€Ïώτο γÏάμμα του ÎµÎ»Î»Î·Î½Î¹ÎºÎ¿Ï Î±Î»Ï†Î±Î²Î®Ï„Î¿Ï….

which is how the text looks raw on my UK-locale database. But now I am not sure how to convert it back to Greek on the front-end.

Normally I would successfully decode non Basic Latin words like café, fiancé, façade using PHP's utf8_encode at back-end and then jQuery's decodeURIComponent on front-end, but with Greek this error comes up

URIError: URI malformed

Is there an in-built jQuery function to convert utf-8 into another format that supports Greek at front-end?

This is how it looks on default load:

enter image description here

And this is what happens when I try to inject the same text via Ajax

enter image description here

Pringles
  • 4,355
  • 3
  • 18
  • 19
  • This a PHP problem, not a jQuery problem. jQuery (read "the browser") can handle Greek text just fine, you are just not *sending* it correctly. There is absolutely nothing you need to do (or can do) on the client, you must fix the server side. This means, plain and simple: Send the data as UTF-8, and *announce* it as UTF-8 (via the Content-Type header). Once this is fixed, the client will start to work. – Tomalak Jan 30 '17 at 15:00
  • This is definitely where you should start: http://stackoverflow.com/questions/279170/utf-8-all-the-way-through – Tomalak Jan 30 '17 at 15:02
  • Works for me: https://jsfiddle.net/nhyeLu5v/ – Hackerman Jan 30 '17 at 15:04
  • 1
    @Hackerman This is a kludge. What's wrong with simply sending UTF-8 to the client? There is nothing "Ajax friendly" in in the way the OP prepares the data before sending it, none of this is necessary in the first place. – Tomalak Jan 30 '17 at 15:08
  • "every single Greek character needs to be escaped, because otherwise Ajax breaks" - It's roughly the opposite: AJAX has to use UTF-8 or it'll break. If your application is not using UTF-8, you'll possibly won't be able to handle Greek at all.. – Álvaro González Jan 30 '17 at 15:47
  • "UK-locale database"? What about the encoding? Are you getting UTF-8 back from it? – Harry Pehkonen Jan 30 '17 at 15:56
  • @Tomalak....did you read the whole question...the OP mentions that `decodeURIComponent` doesn't works...I am just pointing that it does! – Hackerman Jan 30 '17 at 16:14
  • @HarryPehkonen By UK locale I meant utf8_unicode_ci collation, which would store "café" as "café" as opposed to some databases I've seen that store the é exactly as it looks like. – Pringles Jan 30 '17 at 16:24
  • @Koffeehaus One thing you can do to make sure you are actually dealing with UTF-8 characters in php is to go through each character in the string, get its code with `ord($txt[i])`, and error_log it. Optionally, use `dechex(ord($txt[i]))` to get the hex. For é, you should, of course, see c3 and a9. The way I see it, the encoding is getting messed up SOMEWHERE, and now it's just a matter of finding that place. By the way, since it looks like you are using mysql, are you doing `SET NAMES 'utf8'`? – Harry Pehkonen Jan 30 '17 at 16:35
  • @Hackerman, sorry the example I've supplied was too basic. The Greek text also has links in, which is what seems to break it https://jsfiddle.net/nhyeLu5v/3/. – Pringles Jan 30 '17 at 16:42
  • Can you post the full `greek` text....if I apply utf8_decode on that text I just get garbage like text? – Hackerman Jan 30 '17 at 16:46
  • @Hackerman, I have finally figured it out thanks to you! The issue was not with Greek text as such, but with the way Greek hyperlinks were handled. I've posted the full Greek text in case you're still interested to have a look. – Pringles Jan 30 '17 at 18:17
  • *would store "café" as "café" as opposed to some databases I've seen that store the é exactly as it looks like* Sorry but that isn't how text storage works. Absolutely all systems store binary octets (often represented by zeroes and ones, but actually corresponding to different voltage levels or any other dual physical condition) and the character set you chose determines what those bits look like. – Álvaro González Jan 31 '17 at 08:18
  • @ÁlvaroGonzález, yes but my query wasn't about storing bytes on the physical layer, but on collation and storage of human-language data at the Presentation Tier. https://en.wikipedia.org/wiki/OSI_model – Pringles Feb 09 '17 at 14:27

1 Answers1

1

I figured out the problem thanks to @Hackerman and @HarryPehkonen comments.

The original problem was that the Greek text also had hyperlinks with mixed characters.

For example Greek links have Latin-based domain names, but use Greek for semantic slugs.

enter image description here

Which look Greek in the URL bar, but are actually already URL encoded and look like this when copy-pasted into text editor.

https://el.wikipedia.org/wiki/%CE%95%CE%BB%CE%BB%CE%B7%CE%BD%CE%B9%CE%BA%CF%8C_%CE%B1%CE%BB%CF%86%CE%AC%CE%B2%CE%B7%CF%84%CE%BF

And it's the last part that seemed to break things.

So in sample input

Το γράμμα <b >άλφα</b> (<i >ἄλφα</i>) είναι το πρώτο γράμμα του <a href="https://el.wikipedia.org/wiki/%CE%95%CE%BB%CE%BB%CE%B7%CE%BD%CE%B9%CE%BA%CF%8C_%CE%B1%CE%BB%CF%86%CE%AC%CE%B2%CE%B7%CF%84%CE%BF" title="Ελληνικό αλφάβητο" >ελληνικού αλφαβήτου</a>.

Trying to utf8_encode and then json_encode a string which already contains URL encoded sections resulted in the string being neither when decoded back at front-end.

Modifying my utf8ize() function to do an extra iconv('UTF-8', 'UTF-8', $d) fixed the problem.

function utf8ize($d) {  // Encoding workaround

    if(is_array($d)) {

        foreach ($d as $k => $v) {

            $d[$k] = utf8ize($v);
        }

    } elseif (is_string ($d)) {

       return utf8_encode(iconv('UTF-8', 'UTF-8', $d));
    }

    return $d;
}
Pringles
  • 4,355
  • 3
  • 18
  • 19
  • Many people uses utf8_encode() figuring out its functionality but its name, without actually looking it up in [documentation](http://php.net/manual/en/function.utf8-encode.php). – Álvaro González Jan 31 '17 at 08:21