11

I am working on a page that uses JavaScipt to send data to a PHP script via AJAX POST. The problem is, if the input is in a language that is not Latin based I end-up storing gibberish in the MySQL table. Latin alphabet works fine.

The page itself is capable to rendering UTF-8 characters, if they are in a data provided on page load, it's the post that I struggle with.

اختبار

and save. See the Network POST request in browser's dev tools.

The post is made through the following JS function

function createEmptyStack(stackTitle) {
    return $.ajax({
        type:'POST',
        url:'ajax.php',
        data: {
            "do": 'createEmptyStack',
            newTitle: stackTitle
        },
        dataType: "json"
    });
}

Here's my PHP code.

header('Content-Type: text/html; charset=utf-8');

$newTitle = trim($_POST['newTitle']);

$db->query("
INSERT INTO t1(project_id, label) 
VALUES (".$_SESSION['project_id'].", '".$newTitle."')");

When I check for encoding on the page like this:

mb_detect_encoding($_POST['newTitle'], "auto");

I get result: UTF-8

I also tried the following header:

header("Content-type: application/json; charset=utf-8");

MySQL table collation where the data is supposed to go is set to utf8_general_ci

I have another page that has a form where users populate the same table and it works perfectly fine with ANY language. When I check on the other page why it is capable of inserting similar data into db successfully I see the following above insert query:

mysql_query("SET NAMES utf8");

I've attempted putting the same line above my query that the data still looks gibberish. I also tried the following couple alternatives:

 mysql_query("SET CHARACTER SET utf8 ");

and

mysql_set_charset('utf8', $db);

...but to no avail. I'm stomped. Need help getting it figured out.

Environment:

PHP 5.6.40 (cgi-fcgi)

MySQL 5.6.45


UPDATE

I ran more tests.

I used a phrase "this is a test" in Arabic - هذا اختبار

It seems that ajax.php code works properly. After db insert it returns UTF-8 encoded values, that look like: "\u0647\u0630\u0627 \u0627\u062e\u062a\u0628\u0627\u0631" and the encoding is set to:"UTF-8", however the inserted data in my db table appears as: هذا اختبار

So why am I not jumping to converting my db table to different collation? Couple of reasons: it has nearly .5 mil records and it actually works properly when I go to another page that does very similar INSERT.

Turns out my other page is using ASCII encoding when inserting the data. So it's only natural I try to conver to ASCII on ajax.php. The problem I end-up with blank data. I am so confused now...

Thanks


FIXED: based on a few clues I ended-up rewriting all functions for this page to PDO and it worked!

Community
  • 1
  • 1
santa
  • 12,234
  • 49
  • 155
  • 255
  • Is MBString enabled in your server? – Prabhjot Singh Kainth Nov 19 '19 at 05:40
  • The problem might be your db collaration. – MontrealDevOne Nov 19 '19 at 05:57
  • 3
    Do _not_ use the `mysql_*` interface of php. Switch to PDO or mysqli_* – Rick James Nov 21 '19 at 15:54
  • Please show us the gibberish; it will help in debugging the issue. – Rick James Nov 21 '19 at 15:56
  • @RickJames المراكز – santa Nov 21 '19 at 16:00
  • 1
    **Warning:** You are wide open to [SQL Injections](https://php.net/manual/en/security.database.sql-injection.php) and should use parameterized **prepared statements** instead of manually building your queries. They are provided by [PDO](https://php.net/manual/pdo.prepared-statements.php) or by [MySQLi](https://php.net/manual/mysqli.quickstart.prepared-statements.php). Never trust any kind of input! Even when your queries are executed only by trusted users, [you are still in risk of corrupting your data](http://bobby-tables.com/). [Escaping is not enough!](https://stackoverflow.com/q/5741187) – Dharman Nov 23 '19 at 11:40
  • **Warning:** `mysql_*` extension is deprecated as of PHP 5.5.0, and has been removed as of PHP 7.0.0. Instead, either the [mysqli](https://www.php.net/manual/en/book.mysqli.php) or [PDO_MySQL](https://www.php.net/manual/en/book.pdo.php) extension should be used. See also the [MySQL API Overview](https://www.php.net/manual/en/mysqlinfo.api.choosing.php) for further help while choosing a MySQL API. – Dharman Nov 23 '19 at 11:41
  • If you are still using PHP 5 I strongly recommend to upgrade as soon as possible. This version is no longer supported. [Let Rasmus Lerdorf explain it to you](https://youtu.be/wCZ5TJCBWMg?t=2434) – Dharman Nov 23 '19 at 11:42

5 Answers5

5

المراكز is Mojibake, or possibly "double encoding", for المراكز -- Please do SELECT col, hex(col) ... to see which of these looks like:

Mojibake: D8A7D984D985D8B1D8A7D983D8B2
double encoding: C398C2A7C399E2809EC399E280A6C398C2B1C398C2A7C399C692C398C2B2

If Mojibake:

  • The bytes to be stored need to be UTF-8-encoded. Fix this.
  • The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
  • The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
  • HTML should start with <meta charset=UTF-8>.

If double-encoding: This is caused by converting from latin1 (or whatever) to utf8, then treating those bytes as if they were latin1 and repeating the conversion.

More discussion:

Trouble with UTF-8 characters; what I see is not what I stored

Do not use the mysql_* interface in PHP; switch to mysqli_* or PDO interfaces. mysql_* was removed in PHP 5.7.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Wow, this is great. There's definitely some wires crossed somewhere. I just tried again. With INSERT as it the stored value looks like: هذا اختبار and if I wrap it in utf8_encode() it is stored as: هذا اختبار and if I try utf8_decode() I get ??? ?????? – santa Nov 22 '19 at 14:29
  • Fixed! I ended-up re-writing entire routine into PDO and it worked. – santa Nov 22 '19 at 15:47
  • `` is hidden in your answer. – Paul Spiegel Nov 22 '19 at 17:24
  • 1
    Please don't use utf8 in mysql, since it will use only 3 byte utf8, while utf8 actually supports up to 4 bytes. This means that you can only correctly handle characters up to 3 bytes. Unicode makes a lot of use of the 4th byte though, eg emojis. ALWAYS USE utf8_mb4 (multibyte 4). Also, make sure to always set the encoding correctly on each!!: table, database, connection, source files and html. – Daidon Nov 23 '19 at 23:31
  • I agree that utf8mb4 is preferred, but with 5.5 and 5.6 (which the OP is using), there are hiccups, such as with the max index size. Meanwhile, Arabic works equally well in either character set. – Rick James Nov 24 '19 at 01:00
  • @santa - that string with the tilde-A's is probably the "double encoding" of the Arabic. That is, it re-messed up the text! – Rick James Nov 24 '19 at 19:58
3

If your database is latin1, it will store unicode characters as multi-byte characters. If it's utf-8 based, it will still store multiple characters but displayed in a more "sensible" manner.

If, your ر character is represented as XYZ (3 bytes), then when you retrieve XYZ, the browser will reassemble them into a visible ر.

However, if your database is utf-8, it'll further encode each component, so that you are "reliably" seeing XYZ in the end. Let's say X is denoted as x1,x2, and Y is just y, and Z is z1,z2,z3, so instead of seeing ر, which is stored as XYZ, you now see x1x2yz1z2z3, which is shown as XYZ.

Try converting your database to latin1 to at least confirm my theory. Thanks.

Edit:

There is no need to use a utf8 js library. Make sure your page's character encoding is utf8:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

When you POST the data, you can encode it with encodeURIComponent before sending with a XHR request. I'm not sure whether the jQuery flavor of $.ajax already does the encoding.

Schien
  • 3,855
  • 1
  • 16
  • 29
0

This is what I used to make your code works:

<?php

$db = mysqli_connect("localhost", "root", "", "demo");
$db->set_charset("utf8");

// Check connection
if ($db === false) {
    die("ERROR: Could not connect. " . mysqli_connect_error());
}

$newTitle = trim($_POST['newTitle']);

$db->query("
        INSERT INTO t1(project_id, label) 
        VALUES ('5', '" . $newTitle . "')");

add this tag to your html head:

<meta charset="utf-8">

I tested with latin1_bin and utf8_bin and it worked in both cases.

PHP Version 7.3.9

MySQLi 5.0.12-dev

Community
  • 1
  • 1
Kalimah
  • 11,217
  • 11
  • 43
  • 80
  • **Warning:** You are wide open to [SQL Injections](https://php.net/manual/en/security.database.sql-injection.php) and should use parameterized **prepared statements** instead of manually building your queries. They are provided by [PDO](https://php.net/manual/pdo.prepared-statements.php) or by [MySQLi](https://php.net/manual/mysqli.quickstart.prepared-statements.php). Never trust any kind of input! Even when your queries are executed only by trusted users, [you are still in risk of corrupting your data](http://bobby-tables.com/). [Escaping is not enough!](https://stackoverflow.com/q/5741187) – Dharman Nov 23 '19 at 11:42
  • `utf8` is not recommended. Use `utf8mb4` charset! – Dharman Nov 23 '19 at 11:42
  • Please read: [Should we ever check for mysqli_connect() errors manually?](https://stackoverflow.com/q/58808332/1839439) – Dharman Nov 23 '19 at 11:43
  • Obviously I didn't provide a complete example on how to connect to database. My main aim was to show how to insert data and display it correctly when viewing database. I am aware of SQL injections, PDO and MySQLi. The code was for demonstration purposes only. – Kalimah Nov 23 '19 at 12:34
  • Exactly, you demontrated a piece of code that should never be used. – Dharman Nov 23 '19 at 12:35
  • Questions on Stack Overflow are meant to ask a specific problem and they need a specific solution to that problem. The question was not about security, best practice or any other PHP/MySQL feature. It was mainly about encoding and more specifically for encoding Arabic language. – Kalimah Nov 23 '19 at 12:48
  • utf8mb4 is not available in the antique 5.0; it became available in 5.5. – Rick James Nov 24 '19 at 01:15
  • If you are going to recommend a technique that you know is not best practice, don't add it to your script and/or state very clearly the vulnerability of your script so that unknowing researchers don't blindly trust your code. – mickmackusa Nov 27 '19 at 20:27
0

I ran into a very similar problem about a year ago with a system I was working with that ran MySQL 5.7. It sounds to me like some of your database settings are set to utf8 when they should be set to utf8mb4. This allows the database to handle multi-byte characters properly.

NOTE: The utf8mb4 encoding was added to MySQL in version 5.5.

The following queries can be used to change your encoding. Make sure to change the database, table, and column names, as well as the column datatype to fit your situation:

# For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
# For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# For each column:
ALTER TABLE table_name CHANGE column_name column_name DATATYPE CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Further explanation of this can be found here.

Documentation for this on your MySQL version (5.6)

Current, MySQL 8 documentation

derek.wolfe
  • 1,086
  • 6
  • 11
  • See also https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored . Beware of ALTER -- If you use the wrong one; you can make things worse. See http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases – Rick James Nov 24 '19 at 01:14
0

The utf8 topic is a little complicated.

When using UTF8 in MySql, it's important to understand, that MySql's UTF8 only supports 3bytes of data, even though the standard specification allows up to 4bytes. In unicode, you have a lot of characters that actually make use of this 4th byte, such as emojis like this: . With utf8mb4 you can actually fully support and save those in the db without a single problem. UTF8 alone will let you down though.

Just follow those rules, and you should be fine:

  • Make sure all your source files are UTF8 encoded.
  • Make sure you have utf8 as default charset in php.ini:

    default_charset = "utf-8"
    
  • Make sure to use utf-8 charset in the headers of html:

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    
  • Make sure to set header charset to UTF8:

    header("Content-type: application/json; charset=utf-8");
    
  • Make sure to set utf8mb4 for MySql connection in PDO:

    $dsn='mysql:host=example.com;dbname=testdb;port=3306;charset=utf8mb4';
    
  • Make sure to create databases in utf8mb4 or convert db if you must:

    CREATE DATABASE mydatabase CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
    
  • Make sure to create tables in utf8mb4 or convert them if you must:

    CREATE TABLE my_table ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
    
  • VERY IMPORTANT: Make sure to use mb_ string functions in PHP, as normal string functions will only assume single byte data. This means instead of using strlen, which will count each byte, you should use mb_strlen. Also, simple mistakes like accessing a string as array will break your code, since $string[0] will only access the first byte of your string, even though your first character might have 4. Use mb_substr in this case!

For the last one you will need the mbstring extension for php. Also, be aware that some extension require mbstring to have been loaded first, so the order how you load your extension can matter, in case you need to install it.

Also as a side note: please do use PDO and prepared statements. You will find lots and lots of tutorials online. SQL injection is still the biggest vulnarability on the web, and prepared statements is the most effective method to prevent SQL injection!

If you follow the list up there, you won't have problems anymore.

Have fun.

Some reference: https://mathiasbynens.be/notes/mysql-utf8mb4

Daidon
  • 581
  • 1
  • 3
  • 14