2

I have a very strange situation. In our application which is built in perl and mysql. we have legacy tables which are of latin1 charset. one of the user has entered arabic data into this table. when I run the query to fetch the details from this table, it appears like this.

أخلاقيات الأعمال الموضوع

In the application we have a page where this is displayed and it is displaying Arabic characters correctly.

أخلاقيات الأعمال الموضوع

We have a ETL (Extract, Transform and Load) process which dumps this data to another DB which is of Star architecture. here this data is saved in tables which are UTF-8 charset.

Now when I query this table, the same field appears like this

أخلاقيات الأعمال الموضوع

In the UI, we have another page which displays the data from the new table. Here the Arabic characters are not appearing correctly. This is how it is displayed in the front end.

أخلاقيات الأعمال الموضوع

The above text is similar or same as to the initial input that I saw in the MYSQL DB latin1 table.

How do I get to display the Arabic characters correctly in the page. Please help.

Ash_and_Perl
  • 356
  • 5
  • 21
  • 1
    You need to check your ETL process because it dumps data into new database in wrong encoding . That should fix your problem – KuKeC Oct 21 '15 at 12:54
  • As suggested by @KuKeC, I tried looking into the DB side, Found this approach in one of the discussion in this website. Link: [link](http://stackoverflow.com/questions/9407834/mysql-convert-latin1-characters-on-a-utf8-table-into-utf8). what do you think about the solution in this link. – Ash_and_Perl Oct 26 '15 at 14:39

2 Answers2

3

In general, the rule is that you should decode text coming from an external source, and encode text being sent to en external sink. In your case you are fetching what looks like UTF-8-encoded data from a database, so you should be decoding it as such. But it looks like it has been partially decoded. Take a look at this program

use utf8;
use strict;
use warnings;
use feature 'say';

my $xx = 'أخلاقيات الأعمال الموضوع';
my $yy = 'أخلاقيات الأعمال الموضوع';
utf8::encode($yy);

say length $xx;
say join ' ', map { sprintf '%04X', ord } split //, $xx;

say length $yy;
say join ' ', map { sprintf '%04X', ord } split //, $yy;

output

46
00D8 00A3 00D8 00AE 00D9 201E 00D8 00A7 00D9 201A 00D9 0160 00D8 00A7 00D8 00AA 0020 00D8 00A7 00D9 201E 00D8 00A3 00D8 00B9 00D9 2026 00D8 00A7 00D9 201E 0020 00D8 00A7 00D9 201E 00D9 2026 00D9 02C6 00D8 00B6 00D9 02C6 00D8 00B9
46
00D8 00A3 00D8 00AE 00D9 0084 00D8 00A7 00D9 0082 00D9 008A 00D8 00A7 00D8 00AA 0020 00D8 00A7 00D9 0084 00D8 00A3 00D8 00B9 00D9 0085 00D8 00A7 00D9 0084 0020 00D8 00A7 00D9 0084 00D9 0085 00D9 0088 00D8 00B6 00D9 0088 00D8 00B9

This shows the Unicode code points for each character in the string you are fetching directly from the database, and the Arabic text that it should represent encoded in UTF-8 bytes. As you can see, everything matches nicely except for the occasional values between 0x82 and 0x8A, which are replaced in the database text by wide other wide Unicode characters like this

0082 => 201A,
0084 => 201E,
0085 => 2026,
0088 => 02C6,
008A => 0160,

It's clear that what you are retrieving from your database is supposed to be UTF-8-encoded text, so it should be nothing but byte values, so I'm struggling to understand what those wide characters are doing in there

So the short answer is that you should use Encode and decode_utf8 the strings that you fetch from the database. But those wide characters will break that approach so you need to find out why you're getting them

I suggest you use the line from my program that dumps a string in hex values, and apply it directly after the string is pulled from the database.

say join ' ', map { sprintf '%04X', ord } split //, $value;

That way we can see what is happening first-hand. As it is there is a lot of encoding/decoding going on between your program, via the Stack Overflow server to my desktop, and any of those stages may be the culprit

Borodin
  • 126,100
  • 9
  • 70
  • 144
  • `say join ' ', map { sprintf '%04X', ord } split //, $xx;` could be simplified somewhat using `printf "%*v04X\n", " ", $xx;` – Håkon Hægland Oct 21 '15 at 14:55
  • @Borodin Thanks for the detailed explanation. I was searching for other solution where I can handle it. As suggested by KuKeC, I tried looking into the DB side, Found this approach in one of the discussion in this website. Link: [link](http://stackoverflow.com/questions/9407834/mysql-convert-latin1-characters-on-a-utf8-table-into-utf8) – Ash_and_Perl Oct 26 '15 at 14:38
  • @Ash_and_Perl: I'm sorry if I was unclear. There are many ways that your data could have got into that state, because there are many steps on the way from your original database insert, through the internet, to my copy and paste from your question. It's possible that a simple `mysql_set_charset('utf8')` would fix things, but pointing the finger at the ETL process is a very narrow choice. I can add that the data that you show coming from your Star database looks like it has been *double-encoded* in UTF-8, as UTF-8 encoding tends to *increase* the number of bytes in a string – Borodin Oct 26 '15 at 14:54
  • @Ash_and_Perl: My suggestion was to start debugging this by displaying the text directly after the fetch from the original database using the code fragment that I showed you. Rendering the contents of the string in hex prevents it from being corrupted by any subsequent encode or decode, or transmission over the internet, so you could publish your findings -- perhaps in a further question -- with certain knowledge that we can see exactly the data you are dealing with – Borodin Oct 26 '15 at 14:57
0

I'm not sure if Arabic is affected by this, but MySql's utf8 encoding is not always what one wants. After reading this I always use the utf8mb4 encoding.

If nothing else works it might be worth trying that on a local copy.

Csson
  • 158
  • 4