0

I am having an issue where I send some text from a java program to a PHP page, which uploads the text to a MySQL database, and I later retrieve the text in the java program from the MySQL database (through another PHP page). The issue is that I think the MySQL database isn't using the same charset (utf-8) as the java program and the PHP page.

When I send the string "This is a super—long dash" to the PHP page, it returns from the MySQL database as "This is a super—long dash".

I know the first PHP page is receiving the correct string, because I use "echo $_POST[xxx];" to make sure.

I also use the line:

"mysql:host=$servername;dbname=$dbname;charset=utf8"

When creating the PDO to connect to the MySQL database.

I have checked the MySQL database and the string there is "This is a superùlong dash" which leads me to believe it must be some problem with the MySQL, not the PHP or Java.

I tried to correct this by using the fixUTF8 function in the solution here: Detect encoding and make everything UTF-8 by Sebastián Grignoli. However, as a result I get "This is a super?long dash".

EDIT:

Here is my PHP code for uploading to MySQL:

<?php

$servername = "**********";
$username = "**********";
$password = "**********";
$dbname = "**********";

try {
    $conn = new PDO("mysql:host=$servername;dbname=$dbname;charset=utf8", $username, $password);
    $conn->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
    $stmt = $conn->prepare("INSERT INTO authors (first_name, middle_name, last_name, blurb)
VALUES (?, ?, ?, ?)");

    $stmt->execute(array("$_POST[firstName]", "$_POST[middleName]", "$_POST[lastName]", "$_POST[blurb]"));

    echo "New author created successfully";
}
catch(PDOException $e) {
    echo "Error: " . $e->getMessage();
}

$conn = null;
?>

EDIT2:

Here is what SHOW CREATE TABLE authors provides:

enter image description here

Community
  • 1
  • 1
Infamous911
  • 1,441
  • 2
  • 26
  • 41
  • your database have utf_general_ci encoding? column with string too? You created record from java application or what? Need to be created in utf8 too – Daimos Jul 30 '15 at 07:29

1 Answers1

0

When super—long comes out super—long, you have Mojibake.

This is the classic case of

  • The bytes you have in the client are correctly encoded in utf8 (good).
  • You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
  • The column in the table was declared CHARACTER SET latin1. (Or possibly it was inherited from the table/database.) (It should have been utf8.)

The fix for the data is a "2-step ALTER".

ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;

where the lengths are big enough and the other "..." have whatever else (NOT NULL, etc) was already on the column.

Unfortunately, if you have a lot of columns to work with, it will take a lot of ALTERs. You can (should) MODIFY all the necessary columns to VARBINARY for a single table in a pair of ALTERs.

The fix for the code is to establish utf8 as the connection; this depends on the api used in PHP. The ALTERs will change the column definition.

How to avoid Junk/garbage characters while reading data from multiple languages? http://mysql.rjweb.org/doc.php/charcoll#how_mangling_happens_on_insert

Community
  • 1
  • 1
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • When I use the ALTER statements on my MySQL database, I check the database again right after (I only have 4 columns) and the weird characters are still there. Also, when I try to add text with the super—long dash from the java through the PHP, it still shows up as the weird characters. Also, I already establish utf8 as the connection in the PHP by adding "charset=utf8", see my original post. – Infamous911 Jul 31 '15 at 12:03
  • Please provide `SELECT col, HEX(col) FROM ... WHERE ...`. The em-dash _should_ be hex `E28094` if it is correctly encoded in utf8. If it is "double encoded" (bad), it will be `C3A2E282ACE2809D`. – Rick James Jul 31 '15 at 19:59
  • It shows the hex as: 546869732069732061207375706572E280946C6F6E672064617368. After using the alter statements it still shows as the same hex. What does this mean? – Infamous911 Aug 01 '15 at 13:39
  • Nevermind, that was with a full sentence. This time I put as an entry just the em-dash. When I use the `SELECT col, HEX(col) FROM ... WHERE ...` command, it shows: `ù | E28094`. So it is the wrong character but the correct hex. What does this mean? – Infamous911 Aug 01 '15 at 13:47
  • Interesting... I do not see how that hex could display as `ù`. But it pretty clearly says that the data was stored correctly, then not fetched and/or displayed correctly. (Usually the problem is on the `INSERT` side.) More after I scratch my head. – Rick James Aug 01 '15 at 14:39
  • Check the issues in [Best Practice](http://mysql.rjweb.org/doc.php/charcoll#best_practice), especially the `meta` tag and other "output" issues. – Rick James Aug 01 '15 at 14:43
  • A _guess_ is that something other than utf8 or latin1 is somehow being specified during the output, and that leads to the `ù`. I've checked the obvious mistakes using those character sets. – Rick James Aug 01 '15 at 14:47
  • Hmm, the only thing that the page you linked seems to suggest is to use SET NAMES. I tried using this to no avail. Anyway, I believe using "charset=utf8" when creating the PDO should do the same thing in recent versions of PHP, as I am using. See my edited post for my full PHP script for adding to the MySQL database. I don't see what I could change or what I am doing wrong. – Infamous911 Aug 01 '15 at 22:17
  • Do you have another example? Maybe with more than one test case, I can find a pattern. – Rick James Aug 01 '15 at 23:36
  • Alright so I went through a couple utf8 characters from this site: http://www.fileformat.info/info/charset/UTF-8/list.htm. And I found that most things seem to turn into a ? in the MySQL database. These include the characters that have hex: C482, C483, C7AF, CBAD, E1B481, E1B881, E1BA9E, E1BFBC, E28096. Two characters I found that didn't appear as ? were hex: E280A0 (appeared as: å) and E280A1 (appeared as: ç). In all cases the hex was correct in the database but the character shown was either ? or a character that doesn't correspond to the uft8 hex. – Infamous911 Aug 02 '15 at 13:56
  • hmmm... I tried the double-dagger (`E280A1`) with 36 charsets with `CONVERT(UNHEX('E280A1') USING %s)`; none came out with `ç`. Nor do any conversions _start with_ `ç`, which could imply a truncation. I tried some other things, too; again with no good clues. Working the other way, I found no way to turn `ç` into `‡`. – Rick James Aug 02 '15 at 18:03
  • Could you edit the question to add some code for the "output" side, such as the `SELECT` and the ``. – Rick James Aug 02 '15 at 18:03
  • And the table definition (`SHOW CREATE TABLE`). (Yes, I am fishing for straws.) – Rick James Aug 02 '15 at 18:06
  • This problem is so weird! I'm not even using the output side PHP code anymore. I am just inputting the data and then viewing the data in mysql through command prompt with `SELECT * FROM ...`. I used `SHOW CREATE TABLE ...` and saw that the default charset was latin1, so I changed it to be utf8 and then tried the tests again and still had the same results (probably because I had done the alter statement earlier to change the appropriate column to utf8). I edited the OP to add a screenshot of the `SHOW CREATE TABLE ...` statement now. – Infamous911 Aug 03 '15 at 14:38
  • Nothing strange in the `CREATE TABLE`. To help with debugging (from PHP), use `$hex = unpack('H*', $text); echo implode('', $hex);` to display hex. – Rick James Aug 03 '15 at 17:06
  • Using this, the php echos the correct hexes for the various characters. It must be something with the database but I can't figure out what. – Infamous911 Aug 06 '15 at 22:33