1

I am having an issue trying to update my database, I believe related to a charset / collation issue. I've searched all the other related issues, updated collations and charsets, tried everything, and nothing seems to work.

What I'm doing is getting data from scraping some HTML (with permission from the site owner), manipulating it a bit, and then doing an UPDATE to save the manipulated data in my table.

I have a field, reference, that is taken from the HTML, and the update looks for that field, and updates my table if the field matches. If there are no special (non-english) characters, it works fine:

UPDATE database.table SET points = 100 WHERE reference = 'Real Madrid'

If there are any non-english characters in the reference, then the update doesn't work IF I do it from my PHP / HTML site - if I put the query below directly into phpmyadmin, it works fine:

UPDATE database.table SET points = 100 WHERE reference = 'Atlético Madrid'

This happens with every non-english character I've tried, not just é, so that seems to be the root issue.

The HTML I ingest is initially UTF-8, but at some point, it seems the encoding of my text is getting changed from straight UTF-8 to ASCII. Is ASCII not a subset of UTF-8? Not entirely sure if that's the problem, but the encoding is different, which is odd.

Below is my code, with encoding pointed out at different times:

$html = file_get_html('http://url.to.scrape');

// At this point, `mb_detect_encoding($html)` is UTF-8.

    $i = 1;

    while($i <= 20){
    foreach($html->find('tr') as $tableRow) {
      // At this point, `mb_detect_encoding($tableRow) is `ASCII`
      $rowData['team'] = $tableRow->find('td', 0)->plaintext;
      // At this point, `mb_detect_encoding($rowData['team']) is `ASCII`
      $rowData['points'] = $tableRow->find('td', 1)->plaintext;

      $points = $rowData['points'] * doSomeManipulationHere();

      $update_query = "UPDATE database.table SET points = $points WHERE reference = '". $rowData['team'] ."'";
      print_r($update_query);
    }
  }

As mentioned, if $rowData['team'] does not contain non-english characters, it works. If it does contain any, it doesn't.

Again, as mentioned, If I print_r($update_query), and I copy / paste the output directly into phpmyadmin in the SQL tab, it works as expected, even with the é character, so that makes me believe that MySQL charset / collation is set up correctly, and it's somewhere in the PHP / HTML / MySQL connection that's causing the issue.

I guess I need to figure out why my data is suddenly ASCII when is started out as UTF-8.

My setup:
MySQL Server connection collation: utf8mb4_unicode_ci
MySQL Table collation: utf8mb4_unicode_ci
MySQL Field collation: utf8mb4_unicode_ci
PHP default Charset: UTF-8
HTML: <meta charset="utf-8"> .htaccess / charset.conf: AddDefaultCharset UTF-8 (edit: added after originally posted, thanks for the suggestion @asiri)

I've tried sending header('Content-Type: text/html; charset=utf-8');, which didn't help.

I am also seeing the dreaded black question mark when I view those characters on the site, so it's gotta be the encoding somewhere, I just don't know where.

Erik
  • 227
  • 2
  • 8
  • 19

1 Answers1

2

You can try setting up encoding in your .htaccess file.

AddDefaultCharset utf-8

Add this line into the .htaccess file inside the root directory where you have placed your code.

or

try this. The below code will set the encoding into your database connection.

$link = mysqli_connect('localhost', 'user', 'password', 'database');
mysql_set_charset('utf8',$link); 
Asiri Hewage
  • 577
  • 3
  • 16
  • 1
    I didn't mention that in my original post, but I have already done that, not in `.htaccess`, but in my `apache2/conf-enabled/charset.conf` config. I had tried it in the individual `.htaccess` as well, didn't work, figured I'd put it on the server overall to see if that worked. – Erik Jul 08 '20 at 20:29
  • My next suggestion is to set the charset in your database connection. I will edit the answer accordingly. – Asiri Hewage Jul 09 '20 at 04:40
  • 1
    Thanks Asiri, unfortunately this also did not work. I feel like I've tried every suggestion in every single one of the `UTF-8` question threads, and nothing seems to do the trick. It seems like everything is UTF-8 except for when it's time to do the actual query in my PHP code. – Erik Jul 09 '20 at 07:01