General error: 1366 Incorrect string value: '\xD6wn Sm...' for column 'user_agent'

Question

I need to save the user agent of a visiting user.

This is my snippet:

// User Agent
$ua = $_SERVER['HTTP_USER_AGENT']??'';
$ua_md5 = md5($ua);
// Search if the UA already exists in the user_agents table
$ua_id = $db->query("SELECT id FROM user_agents WHERE md5='".$ua_md5."';")->fetchColumn();

if(!$ua_id) {
    // If it doesn't exist, insert it and get its id
    $db->query("INSERT INTO user_agents (md5, user_agent) VALUES ('$ua_md5', ".$db->quote($ua).")");
    $ua_id = $db->lastInsertId();
}

I use PDO:quote instead of prepared statements only for performance reasons (it's faster and this script runs thousands of times per second).

It happens that some users has this user agent:

Mozilla/5.0 (Linux; Android 5.0; \xD6wn Smart Build/LRX21M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/37.0.0.0 Mobile Safari/537.36

And the insert fails for this error:

"PHP message: SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xD6wn Sm...' for column 'user_agent' at row 1" while reading response header from upstream

What is the reason and how can it be fixed?

Edit: more debugging found out the $ua value is:

Mozilla/5.0 (Linux; Android 5.0; ÖWN S1 Build/LRX21M) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/37.0.0.0 Mobile Safari/537.36

Your table is in the wrong character encoding it sounds like. Also using parameterized queries is a better approach. — chris85, Dec 05 '17 at 16:03
I think it's because you didn't wrap the string in quotes. Try wrapping it with single quotes like you did with $ua_md5 — Cave Johnson, Dec 05 '17 at 16:05
`echo "INSERT INTO ... VALUES ('$ua_md5', ".$db->quote($ua).")"` – what do you get? — deceze, Dec 05 '17 at 16:05
You're already using PDO, why not make the extra step and move to prepared statements? — Alon Eitan, Dec 05 '17 at 16:05
@Kodos `PDO::quote` *adds quotes* if required. It's in the name. — deceze, Dec 05 '17 at 16:06
That is not the poing, `\xD6wn` should be escaped into `\\xD6wn`, the encoding is correct — Sofia Grillo, Dec 05 '17 at 16:07
@Sofia Have you really confirmed that it is *not* escaped? Relying on the database's error message is viewing it through too many filters. — deceze, Dec 05 '17 at 16:08
\xD6wn is what is just shown to you in the error message. there are no slashes in your data your problem is encoding — Your Common Sense, Dec 05 '17 at 16:09
Lol who closed the question doesn't understand what the problem is... This string is using only plain latin1 characters, but the function wrongly thinks I'm trying to insert a multibyte character when I'm not — Sofia Grillo, Dec 05 '17 at 16:11
In PHP, "\xD6" is a escape sequence so you don't have a literal backslash anywhere. You have a D6 byte that, in UTF-8, is not a valid character—by pure chance, MySQL converts it to hexadecimal in order to display it.. — Álvaro González, Dec 05 '17 at 16:12
`\xD6wn` is not a multibyte character, it's a string composed of 6 different characters — Sofia Grillo, Dec 05 '17 at 16:13
Not if you type it into a PHP string literal in double quotes: `$str = "\xD6";` → single byte `D6`. — deceze, Dec 05 '17 at 16:13
I don't type it, I take it from `$ua = $_SERVER['HTTP_USER_AGENT']??'';`. How can it be fixed? Prepared statements will do? — Sofia Grillo, Dec 05 '17 at 16:15
OK, that's different than your sample in the question then. Then let's see some debugging, preferably a `var_dump` of the query you're trying to execute with the supposedly quoted value. — deceze, Dec 05 '17 at 16:18
The sooner you will face the truth and check the duplicated question, the sooner you will solve your encoding problem. There is nothing to discuss here — Your Common Sense, Dec 05 '17 at 16:29
The sooner you cough up a `var_dump` the sooner we can help you. — deceze, Dec 05 '17 at 16:29
BTW, if you still want us to help you, [How to create a Minimal, Complete, and Verifiable example for a PDO related problem.](https://phpdelusions.net/pdo/mcve) — Your Common Sense, Dec 05 '17 at 16:32
@deceze for the encoding related problem in the article above I recommend to use `bin2hex()` instead of `var_dump()` as it will reveal all the hidden and mimicking characters — Your Common Sense, Dec 05 '17 at 16:38
If I use the chrome dev tools to set that custom user agent, it works, but in the nginx access log it's replaced to " \x5CxD6WN "... — Sofia Grillo, Dec 05 '17 at 16:43
probably that user is sending a malformed user agent value? There are many of them, from different ips... weird — Sofia Grillo, Dec 05 '17 at 16:46
I couldn't var_dump since i'd not seen the output, but I could `error_log($ua)`, and it shows this: https://i.imgur.com/JIojnDe.png — Sofia Grillo, Dec 05 '17 at 16:54
Malformed or not malformed - why not to just store it tn the database? — Your Common Sense, Dec 05 '17 at 17:11
I don't know, everything is utf8mb4, the connection, the table, the field... This thing can't be stored in unicode I guess — Sofia Grillo, Dec 05 '17 at 17:22
Ah wait. May be it's indeed a malformed utf. Well you can try to fix the string using iconv — Your Common Sense, Dec 05 '17 at 17:35

score 3 · Accepted Answer · answered Dec 06 '17 at 10:32

First of all, when it comes to encoding issues it's always necessary to inspect raw bytes and never rely of the output of some process that renders or interpret the input. In the case of PHP, var_dump() is always a good starting point but you also need to dump to hex or even resort to an hexadecimal editor:

<?php
var_dump(bin2hex($_SERVER['HTTP_USER_AGENT']));

My educated guess so far (and I don't believe it's too far from truth) is:

Some Android browser is sending an HTTP header that includes Öwn Smart Build encoded in ISO-8859-1, where Ö aka 'LATIN CAPITAL LETTER O WITH DIAERESIS' (U+00D6) is encoded as D6.
Your app stack is configured for UTF-8 (the sensible choice), where Ö would be encoded as C396.
PHP doesn't know/care, because PHP strings aren't encoding-aware (they're just byte streams).
MySQL is handled D6, being told it's UTF-8 (but it isn't).
Had D6 been a valid UTF-8 (different) character, or part of a (different) multi-byte sequence, insertion would have completed with minor data loss (the original Öwn text would have been lost and replaced by something else). For good or bad, it isn't valid UTF-8, thus MySQL aborts the insertion with the error message you describe.

Why can't MySQL cope with it? Let's check the UTF-8 definition:

Nr of
Bytes    Byte 1   Byte 2       Byte 3     Byte 4
1      0xxxxxxx       
2      110xxxxx   10xxxxxx     
3      1110xxxx   10xxxxxx   10xxxxxx   
4      11110xxx   10xxxxxx   10xxxxxx   10xxxxxx

Original Latin-1 Öw text is encoded as D6 77 which, translated to binary, is:

11010110 01001101
^^^      ^^

In UTF-8, 110… means "Start of 2-byte character". Second byte should then start with 10… but we have 01… instead. Oops!

How can you solve this? It's trickier than it seems. If you could know for sure that input is ISO-8859-1, it's just a straightforward conversion:

<?php
$input = "\xD6wn";
$output = mb_convert_encoding($input, 'UTF-8', 'ISO-8859-1');
var_dump(bin2hex($input), bin2hex($output));

string(6) "d6776e"
string(8) "c396776e"

(Online demo)

But, how can you know? I'm not sure if the User-Agent header allows MIME Encoded-Words and, even there, the browser may just send invalid data anyway. Perhaps you can catch the error (MySQL error code 1366 aka ER_TRUNCATED_WRONG_VALUE_FOR_FIELD looks reasonably precise) and try again assuming ISO-8859-1. And it's probably also a good idea to verify that input is valid UTF-8, though it can make processing kind of bothersome:

<?php
$latin1 = "\xD6wn";
$utf8 = "\xc3\x96wn";
var_dump(mb_check_encoding($latin1, 'UTF-8'));
var_dump(mb_check_encoding($utf8, 'UTF-8'));

bool(false)
bool(true)

(Online demo)

Thanks, this is the output of bin2hex: 4d6f7a696c6c612f352e3020284c696e75783b20416e64726f696420352e303b20d6574e205331204275696c642f4c525832314d29204170706c655765624b69742f3533372e333620284b48544d4c2c206c696b65204765636b6f292056657273696f6e2f342e30204368726f6d652f33372e302e302e30204d6f62696c65205361666172692f3533372e3336 — Sofia Grillo, Dec 06 '17 at 14:55
From what I'm reading the HTTP specification says the headers default character set is ISO-8859-1. Do you think I should change that field to latin1? — Sofia Grillo, Dec 06 '17 at 15:17
...I can't because I have a lot of user agents in utf8 like `Mozilla/5.0 (Linux; U; Android 4.1.2; ar-ye; SPH-D710 Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 الجوال Safari/534.30` — Sofia Grillo, Dec 06 '17 at 15:27
If specs say so (again, I'm not particularly familiar with them) then it *should* be safe to convert from ISO-8859-1 to UTF-8 (the `mb_convert_encoding()` part of my examples) because UTF-8 can store the complete Unicode catalogue. But, of course, it's useless if you're effectively getting mixed encodings in the wild. Perhaps you're doomed to checking for valid UTF-8 with `mb_check_encoding()` and assuming Latin-1 otherwise. BTW, I can't decode your hex string with `hex2bin()`, it has several non-printable characters scattered here and there :-? — Álvaro González, Dec 06 '17 at 16:12
using tools like this https://www.rapidtables.com/convert/number/hex-to-ascii.html show you were right, it's `ÖWN` — Sofia Grillo, Dec 07 '17 at 10:36
That page claims to convert to ASCII but ASCII is a 7-bit encoding that cannot store characters like `Ö`. Unfortunately, there's a lot of misunderstanding out there about how computers store texts. — Álvaro González, Dec 07 '17 at 10:49
`d6` is the 2-digit hex, then it does in javascript `String.fromCharCode(parseInt('d6', 16))` = `Ö` — Sofia Grillo, Dec 07 '17 at 12:13
That's like that because [String.fromCharCode](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCharCode) has a badly designed but clear spec: it expects UCS-2, where `Ö` is presented as `D6 00` or `00 D6` (depending on the [endianness](https://en.wikipedia.org/wiki/Endianness)) and, by pure chance, the missing byte is a zero that gets implicitly assumed. — Álvaro González, Dec 07 '17 at 13:12

General error: 1366 Incorrect string value: '\xD6wn Sm...' for column 'user_agent'

1 Answers1