1

I have string and wants to convert to display human readable format. Below is the string and to decode to readable format like to σταύρος. I have tried utf8 encoding but not worked.

σταÏÏος

I have tried many ways but it is not clear what encoding used with it to convert it to σταύρος

Laeeq
  • 403
  • 1
  • 3
  • 14
  • How did you obtain the string? – Håkon Hægland May 05 '20 at 11:40
  • I have get the same string from database but don't know how it was encoded. When loaded on web page, it's displaying σταύρος but when I have get it from database and displaying, then it is displaying in encoded form. – Laeeq May 05 '20 at 11:47
  • By Adding `use Devel::Peek; Dump( $text );` when you get the value in your script/package, you can see the exact data you got in the error-log. – Georg Mavridis May 05 '20 at 13:07
  • What is `codepage` of your terminal where the string showed `encoded`? What codepage/encoding you have in web browser for page where this string showed properly? (Tip: look at source of the web page). For Greek language allocated **iso-8859-7** codepage encoding [Encode::Supported](https://metacpan.org/pod/Encode::Supported) – Polar Bear May 05 '20 at 20:01
  • Please see the following [answer](https://stackoverflow.com/questions/1049728/how-do-i-see-what-character-set-a-mysql-database-table-column-is) which can give a clue how you can verify data collation for database, tables, columns. – Polar Bear May 05 '20 at 20:07
  • Verification of source-code for web page can give a clue about used charset `` -- [HTML Unicode](https://www.w3schools.com/charsets/ref_html_utf8.asp) – Polar Bear May 05 '20 at 20:18
  • 1
    How did you get it from the db? Using DBI? If so, what DBD? mysql? If so, did you use `mysql_enable_utf8mb4`? If so, what is the output of `sprintf("%vX", $s)` for the sting you get from the db? Did you get `3C3.3C4.3B1.3CD.3C1.3BF.3C2`? If so, you are getting the correct string from the db, and the problem is with how you encode your output. If not, you aren't getting the correct string from the db. – ikegami May 05 '20 at 21:40

1 Answers1

0

Your sample looks like this in my browser: sample text When you are posting a question about how characters are rendered, you should always include an image. The characters may not render the same on other people's computers as they do on yours. They could even get re-encoded by the Stack Overflow server. In this answer I assume that SO is delivering the same bytes that you posted and that I am seeing the same thing that you see.

Your characters are delivered in UTF-8 by the database, but they are being rendered as Windows-1252. The first question is whether Perl knows that it is getting UTF-8 characters. length $tring will tell you how many characters Perl thinks it sees. If 7, then Perl knows that the data is in UTF-8. If 14, then Perl is unsure what it has, so it's just counting bytes. If 12, then Perl has already decided that the data is in Windows-1252 (two of your bytes being discarded as invalid characters).

My guess is that you'll get 14, so Perl uses the shell's default encoding for the output. Are you on a Windows machine? If you get either 12 or 14, then you need to tell Perl that the input data is in UTF-8. If you're reading from a file handle, then you just need to insert the line binmode FH, ':encoding(UTF-8)' right after you open the file handle. My guess is that you are using a database API package. If so, then you need to read the documentation for the package to see how to set the encoding.

If length $tring gives 7, then Perl knows what it has, and the problem is on the output. If you want help with that, then you'll need to add details to your question about how you are viewing the output. If you are just printing to the terminal, then try binmode STDOUT, ':encoding(UTF-8)' before you start printing.

If you want to inspect the data as Perl sees it, then use unpack 'H*', $tring. You will either get cf83cf84ceb1cf8dcf81cebfcf82 or cf83cf84ceb1cfcfcebfcf82, depending on whether Perl has already discarded the two invalid Windows-1252 bytes.

Arnold Cross
  • 199
  • 1
  • 12