I am having an issue trying to update my database, I believe related to a charset / collation issue. I've searched all the other related issues, updated collations and charsets, tried everything, and nothing seems to work.
What I'm doing is getting data from scraping some HTML (with permission from the site owner), manipulating it a bit, and then doing an UPDATE
to save the manipulated data in my table.
I have a field, reference
, that is taken from the HTML, and the update looks for that field, and updates my table if the field matches. If there are no special (non-english) characters, it works fine:
UPDATE database.table SET points = 100 WHERE reference = 'Real Madrid'
If there are any non-english characters in the reference
, then the update doesn't work IF I do it from my PHP / HTML site - if I put the query below directly into phpmyadmin
, it works fine:
UPDATE database.table SET points = 100 WHERE reference = 'Atlético Madrid'
This happens with every non-english character I've tried, not just é
, so that seems to be the root issue.
The HTML I ingest is initially UTF-8
, but at some point, it seems the encoding of my text is getting changed from straight UTF-8
to ASCII
. Is ASCII
not a subset of UTF-8
? Not entirely sure if that's the problem, but the encoding is different, which is odd.
Below is my code, with encoding pointed out at different times:
$html = file_get_html('http://url.to.scrape');
// At this point, `mb_detect_encoding($html)` is UTF-8.
$i = 1;
while($i <= 20){
foreach($html->find('tr') as $tableRow) {
// At this point, `mb_detect_encoding($tableRow) is `ASCII`
$rowData['team'] = $tableRow->find('td', 0)->plaintext;
// At this point, `mb_detect_encoding($rowData['team']) is `ASCII`
$rowData['points'] = $tableRow->find('td', 1)->plaintext;
$points = $rowData['points'] * doSomeManipulationHere();
$update_query = "UPDATE database.table SET points = $points WHERE reference = '". $rowData['team'] ."'";
print_r($update_query);
}
}
As mentioned, if $rowData['team']
does not contain non-english characters, it works. If it does contain any, it doesn't.
Again, as mentioned, If I print_r($update_query)
, and I copy / paste the output directly into phpmyadmin
in the SQL
tab, it works as expected, even with the é character, so that makes me believe that MySQL charset / collation is set up correctly, and it's somewhere in the PHP / HTML / MySQL connection that's causing the issue.
I guess I need to figure out why my data is suddenly ASCII
when is started out as UTF-8
.
My setup:
MySQL Server connection collation: utf8mb4_unicode_ci
MySQL Table collation: utf8mb4_unicode_ci
MySQL Field collation: utf8mb4_unicode_ci
PHP default Charset: UTF-8
HTML: <meta charset="utf-8">
.htaccess
/ charset.conf
: AddDefaultCharset UTF-8
(edit: added after originally posted, thanks for the suggestion @asiri)
I've tried sending header('Content-Type: text/html; charset=utf-8');
, which didn't help.
I am also seeing the dreaded black question mark �
when I view those characters on the site, so it's gotta be the encoding somewhere, I just don't know where.