0

For a few days now I've been looking for a solution to display UTF8 on my webpage. The character currently causing trouble is į (unicode: \u012f decimal: 303) however, there are over 10,000 records in my database and I cannot guarantee that all others are displaying correctly. So I'm looking for a solution that should cover all characters.

The į is displaying as a ? in the HTML.

My setup is a HTML page, which uses AJAX to send a request to a PHP file. The PHP then queries a MYSQL database to find a specific entry, it then takes a lithuanian word from that entry and echoes it as a response to AJAX. Back in the Javascript, the response is set as the innerHTML of a HTML element. This current setup is not using JQuery.

Below is my progress on attempting to fix the issue.

First, I verified that all files I was working with are correctly encoded to UTF8, not UTF8BOM.

Then I opened the MYSQL database in phpMyAdmin to view the entries. Seeing characters replaced with ? in the entries, I done some research and found the database had the wrong collation. After changing the collation to utf8_general_ci for the database/table nothing changed, so I looked into it further and found that changing it for individual columns of a table was another solution. This worked and my database is now displaying the characters correctly.

Next the character š (unicode: \u0161 decimal: 353) would not display in my webpage, I fixed this by using the following code in PHP which I found on stackoverflow.

function encode_string($string){ 
    $encoded = ""; 
    for ($n=0;$n<strlen($string);$n++){ 
        $check = htmlentities($string[$n],ENT_QUOTES); 
       $string[$n] == $check ? $encoded .= "&#".ord($string[$n]).";" : $encoded .= $check; 
    } 
    return $encoded; 
} 

I can't say I completely understand this code but it caused the character š to display correctly when it got to my HTML. However this did not work for the character į.

I have also tried $conn->set_charset('utf8'); to set the connection to use utf8 however this resulted in į being displayed as į instead, same result for $conn->query("SET NAMES UTF8;");

I have found that hardcoding the į into the Javascript or PHP, allow it to be sent back and displayed correctly, for example echo "į"; works. So I believe the issue may be related to the database or in the PHP before the echo. However I don't have the knowledge to identify the problem.

Here is my php code below:

<?php
header('Content-Type: text/html charset=utf-8');
//Connection to database is made. Referred to as $conn

$sql = "SELECT * FROM Words";
$result = $conn->query($sql);

if ($result->num_rows > 0) {

    //Loop through the results to find a word with the status of 1
    while($row = $result->fetch_assoc()) {

        $status = $row["status"];

        if($status == 1){
            //respond to AJAX with the word

            $ltword = trim($row["lt"]);


            echo utf8_encode(encode_string($ltword));
            //Has also been tested as 
            //echo encode_string($ltword);
            //with no noticeable difference.


            break;
        }
    }

}


function encode_string($string){ 
    $encoded = ""; 
    for ($n=0;$n<strlen($string);$n++){ 
        $check = htmlentities($string[$n],ENT_QUOTES); 
       $string[$n] == $check ? $encoded .= "&#".ord($string[$n]).";" : $encoded .= $check; 
    } 
    return $encoded; 
}

?>

At the core my question is, given my current setup, how do I correctly get an encoded UTF8 character from my database to display on my webpage?

EDIT: The mb_check_encoding() function of php, verifies that the data received from the database is valid utf8.

php.ini is using utf8 as it's default charset.

Using $conn->character_set_name(); returns the result latin1. Using $conn->set_charset("utf8"); causes it return utf8, however į is then displayed as į which is still incorrect.

  • To the best of my knowledge, I believe I've used the methods described in answers to that question, none of which worked for me. – Thomas McSherry Sep 12 '16 at 11:49
  • Encoding `į` as UTF-8 gives you a 2-byte string, and `į` is what you get when you display those bytes as windows-1252. That happens when your HTML pages are served with the wrong encoding header. Check the **Output** part of the linked answer. – roeland Sep 12 '16 at 22:06
  • Do not use any encode functions. Do read [_this_](http://stackoverflow.com/a/38363567/1766831) to see what might be going wrong. (You have several possibilities.) – Rick James Sep 12 '16 at 22:15

3 Answers3

1

If you're using mysqli, you can call set_charset():

$mysqli->set_charset('utf8mb4');       // object oriented style
mysqli_set_charset($link, 'utf8mb4');  // procedural style
Yogesh Singasane
  • 283
  • 3
  • 12
0

in your case problem was collation, which was modified later. As a good practice try to set table collation as well as column collation same ie. utf8_unicode_ci (general is faster but unicode is much better for sort/display).

Now coming back to problem, the problem lies with already added data which was stored wrong due to non proper collation. For that you need to look & resolve method as you cant be sure it was stored properly.

iSensical
  • 747
  • 5
  • 8
  • Thanks for the response, I tested this just now using a new database and table, using the utf8_unicode_ci collation and then again with utf8_general_ci collation. However it did not make a difference. – Thomas McSherry Sep 12 '16 at 11:46
0

If you have UTF8 end to end (db > connection > php) you should not have to echo utf8_encode. Just echo the variable and it should display correctly.

Most likely, the character is is messed up in the database because it's still in the original encoding. Try updating the contents of the database with native UTF8 characters now that the collation has been fixed and it should work.

So most likey you will need the $conn->set_charset('utf8') too.

Michael
  • 1,247
  • 1
  • 8
  • 18
  • I tested this just now using a new database and table, using the utf8_unicode_ci collation and then again with utf8_general_ci collation. However neither made a difference. I have already attempted `$conn->set_charset('utf8');` as I mentioned in my question, it returned a different character from the ? however, still not the correct character. – Thomas McSherry Sep 12 '16 at 11:52
  • Did you use new, clean native UTF 8 for the input into the new database? – Michael Sep 13 '16 at 18:50