1

I have column name ad_text data type is varchar and collation is utf8_general_ci in my database in which few fields are ASCII and few are UTF-8 which i know by using this function mb_detect_encoding()

enter image description here

now when i try to display before executing this

mysql_query ("set character_set_results='utf8'");

ASCII field convert into utf-8 and display correctly but utf-8 field display like this

(ضرورت لوڈرزکوڑا کرکٹ اٹھان)

and when i don't execute above code ASCII field display as this

(??????? ??????? ????????)

and UTF-8 fields display correctly.

M Talha Afzal
  • 231
  • 1
  • 5
  • 17
  • `utf8_general_ci` is the collation, right? What is the charset? Arabic won't work as ASCII (unless converted to entities). Was this written as ASCII to start with? – chris85 Jan 29 '16 at 13:33
  • Sorry right varchar is datatype and utf8_general_ci is collation. Actually when i imported a file of urdu/arabic text since than after fields showing ASCII encoding – M Talha Afzal Jan 29 '16 at 13:36
  • The column has collation and character set, varchar is the datatype. http://stackoverflow.com/questions/341273/what-does-character-set-and-collation-mean-exactly Ascii is pretty much the latin alphabet, http://www.asciitable.com/index/asciifull.gif, so arabic characters aren't gong to display with it unless you convert them to entities. The entity characters are made up of ascii characters/ – chris85 Jan 29 '16 at 13:38
  • 1
    You shouldn't use mysql_* functions https://stackoverflow.com/questions/12859942/why-shouldnt-i-use-mysql-functions-in-php – Michas Jan 29 '16 at 13:46
  • @chris85 - `COLLATE utf8_general_ci` implies `CHARACTER SET utf8` unambiguously. – Rick James Jan 29 '16 at 22:18
  • @RickJames you can update the collation in phpmyadmin pretty easily without affecting the character set. – chris85 Jan 29 '16 at 22:21
  • Yes, but a `utf8...` collation _cannot_ apply to `ascii` or `latin1` character set, etc. Possibly phpmyadmin changes the character set when you change the collation. Do `SHOW CREATE TABLE` to see what you _really_ have. – Rick James Jan 29 '16 at 22:37

4 Answers4

3

(Further analysis implies that double encoding is the real problem. This Answer discusses the cause and cure of such. The discussion applies to any characters, not just Arabic or the é example. 'utf8' and 'utf8mb4' are interchangeable for this discussion.)

Here's what probably happened.

  • The client had characters encoded as utf8 (good); and
  • SET NAMES latin1 lied by claiming that the client had latin1 encoding; and
  • The column in the table declared CHARACTER SET utf8 (good).

Let's walk through what happens to e-acute: é.

  1. The hex for that, in utf8 is 2 bytes: C3A9.
  2. SET NAMES latin1 saw it as 2 latin1-encoded characters à and © (hex: C3 and A9)
  3. Since the target was CHARACTER SET utf8, those 2 characters needed to be converted. Ã was converted to utf8 (hex C383) and © (hex C2A9)
  4. So, 4 bytes were stored (hex C383 C2A9 for é)

When reading it back out, the reverse steps were performed, and the end user possibly noticed nothing wrong. What is wrong:

  • The data stored is 2 times as big as it should be (3x for Asian languages).
  • Comparisions for equal, greater than, etc may not work as expected.
  • ORDER BY may not work as expected.

The fix (2 parts):

  1. Be sure to do SET NAMES utf8; (or equivalent, such as mysqli_set_charset('utf8')). Keep in mind that init_connect is not executed when you connect as root.

  2. Something like this will repair your data:

    UPDATE tbl SET col = CONVERT(BINARY( CONVERT(col USING latin1)) USING utf8);

Rick James
  • 135,179
  • 13
  • 127
  • 222
0

you can use after retrieving data:

//string mb_convert_encoding ( string $str , string $to_encoding [, mixed $from_encoding = mb_internal_encoding() ] )

$str = mb_convert_encoding($row['ad_text'], 'UTF-8', 'ASCII');

or

$str = mb_convert_encoding($row['ad_text'], 'UTF-8', 'auto');

http://php.net/manual/en/function.mb-convert-encoding.php

here is SUPPORTED encoding http://php.net/manual/en/mbstring.supported-encodings.php

Luca Giardina
  • 478
  • 4
  • 14
0

try this one

iconv('ASCII', 'UTF-8//IGNORE', $str);

http://php.net/manual/en/function.iconv.php

Pratik Bhalodiya
  • 736
  • 7
  • 14
0

ضرورت لوڈ is "Mojibake" for 'ضرورت لوڈ'. It comes from one form of mistake.

?????? comes from a different mistake.

  • I assume the bytes in your client are utf8-encoded. (Good)
  • You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
  • The column(s) in the table(s) were probably CHARACTER SET latin1. (It should have been utf8.)
  • The HTML output should include <meta charset=UTF-8>.

Let's check one more thing... Please do SELECT col, HEX(col) FROM tbl WHERE... to pick up that string (or something like it). You should see

D8B6 D8B1 D988 D8B1 D8AA D984 D988 DA88

(without the spaces) for the HEX for 'ضرورت لوڈ'. If you do, then it was correctly stored. If you get this, you have the dreaded "double encoding":

C398 C2B6 C398 C2B1 C399 CB86 C398 C2B1 C398 C2AA C399 E2809E C399 CB86 C39A CB86

Back to the title... "ASCII", at least the 7-bit stuff that includes digits and English letters, is a subset of utf8. So, there is never a need to "convert ascii to utf8".

Bottom line:

  • Don't use mysql_* API; switch to mysqli_* or PDO.
  • Establish utf8 on the connection: mysqli_set_charset('utf8') or equivalent.
  • Make sure the column/table is CHARACTER SET utf8. (See note below.)
  • Check the html for the meta tag.

Note: If the CHARACTER SET is not utf8, I need to know what the HEX is before advising you on how to ALTER the table without screwing things up further.

Rick James
  • 135,179
  • 13
  • 127
  • 222