2

I need to save the user agent of a visiting user.

This is my snippet:

// User Agent
$ua = $_SERVER['HTTP_USER_AGENT']??'';
$ua_md5 = md5($ua);
// Search if the UA already exists in the user_agents table
$ua_id = $db->query("SELECT id FROM user_agents WHERE md5='".$ua_md5."';")->fetchColumn();

if(!$ua_id) {
    // If it doesn't exist, insert it and get its id
    $db->query("INSERT INTO user_agents (md5, user_agent) VALUES ('$ua_md5', ".$db->quote($ua).")");
    $ua_id = $db->lastInsertId();
}

I use PDO:quote instead of prepared statements only for performance reasons (it's faster and this script runs thousands of times per second).

It happens that some users has this user agent:

Mozilla/5.0 (Linux; Android 5.0; \xD6wn Smart Build/LRX21M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/37.0.0.0 Mobile Safari/537.36

And the insert fails for this error:

"PHP message: SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xD6wn Sm...' for column 'user_agent' at row 1" while reading response header from upstream

What is the reason and how can it be fixed?


Edit: more debugging found out the $ua value is:

Mozilla/5.0 (Linux; Android 5.0; ÖWN S1 Build/LRX21M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/37.0.0.0 Mobile Safari/537.36

Sofia Grillo
  • 405
  • 2
  • 13

1 Answers1

3

First of all, when it comes to encoding issues it's always necessary to inspect raw bytes and never rely of the output of some process that renders or interpret the input. In the case of PHP, var_dump() is always a good starting point but you also need to dump to hex or even resort to an hexadecimal editor:

<?php
var_dump(bin2hex($_SERVER['HTTP_USER_AGENT']));

My educated guess so far (and I don't believe it's too far from truth) is:

  1. Some Android browser is sending an HTTP header that includes Öwn Smart Build encoded in ISO-8859-1, where Ö aka 'LATIN CAPITAL LETTER O WITH DIAERESIS' (U+00D6) is encoded as D6.

  2. Your app stack is configured for UTF-8 (the sensible choice), where Ö would be encoded as C396.

  3. PHP doesn't know/care, because PHP strings aren't encoding-aware (they're just byte streams).

  4. MySQL is handled D6, being told it's UTF-8 (but it isn't).

  5. Had D6 been a valid UTF-8 (different) character, or part of a (different) multi-byte sequence, insertion would have completed with minor data loss (the original Öwn text would have been lost and replaced by something else). For good or bad, it isn't valid UTF-8, thus MySQL aborts the insertion with the error message you describe.

Why can't MySQL cope with it? Let's check the UTF-8 definition:

Nr of
Bytes    Byte 1   Byte 2       Byte 3     Byte 4
1      0xxxxxxx       
2      110xxxxx   10xxxxxx     
3      1110xxxx   10xxxxxx   10xxxxxx   
4      11110xxx   10xxxxxx   10xxxxxx   10xxxxxx

Original Latin-1 Öw text is encoded as D6 77 which, translated to binary, is:

11010110 01001101
^^^      ^^

In UTF-8, 110… means "Start of 2-byte character". Second byte should then start with 10… but we have 01… instead. Oops!

How can you solve this? It's trickier than it seems. If you could know for sure that input is ISO-8859-1, it's just a straightforward conversion:

<?php
$input = "\xD6wn";
$output = mb_convert_encoding($input, 'UTF-8', 'ISO-8859-1');
var_dump(bin2hex($input), bin2hex($output));
string(6) "d6776e"
string(8) "c396776e"

(Online demo)

But, how can you know? I'm not sure if the User-Agent header allows MIME Encoded-Words and, even there, the browser may just send invalid data anyway. Perhaps you can catch the error (MySQL error code 1366 aka ER_TRUNCATED_WRONG_VALUE_FOR_FIELD looks reasonably precise) and try again assuming ISO-8859-1. And it's probably also a good idea to verify that input is valid UTF-8, though it can make processing kind of bothersome:

<?php
$latin1 = "\xD6wn";
$utf8 = "\xc3\x96wn";
var_dump(mb_check_encoding($latin1, 'UTF-8'));
var_dump(mb_check_encoding($utf8, 'UTF-8'));
bool(false)
bool(true)

(Online demo)

Álvaro González
  • 142,137
  • 41
  • 261
  • 360
  • Thanks, this is the output of bin2hex: 4d6f7a696c6c612f352e3020284c696e75783b20416e64726f696420352e303b20d6574e205331204275696c642f4c525832314d29204170706c655765624b69742f3533372e333620284b48544d4c2c206c696b65204765636b6f292056657273696f6e2f342e30204368726f6d652f33372e302e302e30204d6f62696c65205361666172692f3533372e3336 – Sofia Grillo Dec 06 '17 at 14:55
  • From what I'm reading the HTTP specification says the headers default character set is ISO-8859-1. Do you think I should change that field to latin1? – Sofia Grillo Dec 06 '17 at 15:17
  • ...I can't because I have a lot of user agents in utf8 like `Mozilla/5.0 (Linux; U; Android 4.1.2; ar-ye; SPH-D710 Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 الجوال Safari/534.30` – Sofia Grillo Dec 06 '17 at 15:27
  • If specs say so (again, I'm not particularly familiar with them) then it *should* be safe to convert from ISO-8859-1 to UTF-8 (the `mb_convert_encoding()` part of my examples) because UTF-8 can store the complete Unicode catalogue. But, of course, it's useless if you're effectively getting mixed encodings in the wild. Perhaps you're doomed to checking for valid UTF-8 with `mb_check_encoding()` and assuming Latin-1 otherwise. BTW, I can't decode your hex string with `hex2bin()`, it has several non-printable characters scattered here and there :-? – Álvaro González Dec 06 '17 at 16:12
  • using tools like this https://www.rapidtables.com/convert/number/hex-to-ascii.html show you were right, it's `ÖWN` – Sofia Grillo Dec 07 '17 at 10:36
  • 1
    That page claims to convert to ASCII but ASCII is a 7-bit encoding that cannot store characters like `Ö`. Unfortunately, there's a lot of misunderstanding out there about how computers store texts. – Álvaro González Dec 07 '17 at 10:49
  • `d6` is the 2-digit hex, then it does in javascript `String.fromCharCode(parseInt('d6', 16))` = `Ö` – Sofia Grillo Dec 07 '17 at 12:13
  • 1
    That's like that because [String.fromCharCode](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCharCode) has a badly designed but clear spec: it expects UCS-2, where `Ö` is presented as `D6 00` or `00 D6` (depending on the [endianness](https://en.wikipedia.org/wiki/Endianness)) and, by pure chance, the missing byte is a zero that gets implicitly assumed. – Álvaro González Dec 07 '17 at 13:12