0

I've published an app, and I find some of the comments to be like this: РекамедÑ

I have googled a lot and I cannot decode it so that the comment will not be shown this way. This is the way it is stored in database; it can be in Cyrillic, but I could not decode it as well. Any clue on how to understand this kind of comments?

Filip Luchianenco
  • 6,912
  • 9
  • 41
  • 63
  • It was probably doubly HTML encoded. When decoding it twice using https://mothereff.in/html-entities the result is `РекамедÑ`. That *could* be botched Unicode data – Pekka Feb 08 '16 at 15:19
  • and how come a simple user could write that from his mobile phone? – Filip Luchianenco Feb 08 '16 at 15:20
  • This is not what users have typed in, this is how your input form has screwed it. You need to find and fix the bug, not 'decode' the accidental garbage. – hamstergene Feb 08 '16 at 15:21
  • It could be that they are entering comments say in UTF-8, but in a non-western character set. Then probably a silly, misguided server-side "sanitation" routine garbles the data. – Pekka Feb 08 '16 at 15:21
  • 1
    Mismatched text encoding somewhere in your pipeline -- the user didn't botch it up. – Daniel Beck Feb 08 '16 at 15:22
  • the thing is that other comments are fine. what could this user type so that the output is this? – Filip Luchianenco Feb 08 '16 at 15:22
  • `what could this user type so that the output is this?` هذا، على سبيل المثال – Pekka Feb 08 '16 at 15:22
  • లేదా ఈ! ఏ ఆలోచన లాంగ్వేజ్గా ఉంది – Pekka Feb 08 '16 at 15:23
  • 1
    Anything at all, in a locale or language that uses a different text encoding than you're set up for. (When in doubt, just UTF-8 everything; these days that'll get you 90% of the way there at least) – Daniel Beck Feb 08 '16 at 15:24
  • 1
    thank you guys. The problem was with Russian language input. It would be great if any of you can post a formatted answer so that I can accept it. – Filip Luchianenco Feb 08 '16 at 15:29

1 Answers1

1

These appear to be doubly encoded HTML entities. So for example, & was turned to & and that was then again turned to &

When decoding the data twice using this online tool (there are many others) the result is

РекамедÑ

That could be Unicode data, e.g. UTF-8 in a non-western character set like Cyrillic or Arabic, that

  1. was misinterpreted as single-byte input
  2. was garbled by a misguided "sanitation" method, possibly a call or two to PHP's htmlentities() (which incidentally assumes the single-byte ISO-8859-1 encoding by default in older versions, so a call to this function could be the whole source of the problem).

The fix will likely need to be on server side.

If you are using PHP, see UTF-8 all the way through for a handy guide.

Community
  • 1
  • 1
Pekka
  • 442,112
  • 142
  • 972
  • 1,088