1

my app is handling delivery addresses of people's orders in a webshop / connected marketplace like ebay. I already accounted for UTF-8 encoding meaning it handles kyrillic, chinese etc characters correctly. However, from time to time I have entries with an unknown character � which already appears for example in the delivery address as viewed at ebay. So there's nothing going wrong along the way - the string is delivered like that.

Now at some point I am performing an address check against an official (german) address DB like so:

$query = "SELECT DISTINCT * FROM adrCheck WHERE zip='".$zip."' AND street='".$street." AND city='".$city."'";

In case there is at least one result, I know the address must be correct. Anyhow, when those incorrect characters appear I get a SQL error MYSQLi Error (#1267): Illegal mix of collations (cp850_general_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '=' which I can react to. BUT I want to be able to check beforehand and include only those parameters into the query which are correctly encoded.

I have tried

print_r(mb_detect_encoding("K�ln")); // gives me UTF-8
print_r(mb_check_encoding("K�ln", "UTF-8")); // gives me 1 / true

and the preg_match method which also tells me that it's valid UTF-8.

What am I overlooking? Any suggestions on how to handle this occasional snafu user input?

Community
  • 1
  • 1
Engle
  • 45
  • 4
  • what charset does the database use for tables and do you set the meta charset in html? it would nice if you used prepared statements than simple `.$var.` – Gntem Apr 17 '17 at 11:33
  • The DB is configured with utf8_general_ci. And I also use `` in my . Additionally I set `header('Content-Type: text/html; charset=utf-8');` at the very beginning of my index.php. And last but not least I query `SET NAMES 'utf8'` right after I initialize my connection.. Might be overkill?! – Engle Apr 17 '17 at 11:38
  • youre right about the prepared statements. Though I went with this simpler method because the inputs have been validated through several DBs before (ebay, Amazon, the shop software...) and I don't write to the DB in this instance. – Engle Apr 17 '17 at 11:42
  • take a look here http://stackoverflow.com/questions/1008287/illegal-mix-of-collations-mysql-error – Gntem Apr 17 '17 at 11:42
  • makes no difference... I expected as much because DB collation, charset etc have all been set before. As I have mentioned, most special chars (ÄÖÜß, kyrillic, chinese...) are being handled correctly – Engle Apr 17 '17 at 11:55

3 Answers3

0

Your problem occurrs because you are receiving a latin-1 encoded string (most likely, because you mentioned something about German), and try to use those as a UTF-8 string. This works fine most of the time, because latin-1 builds on top of ASCII, and all caracters of ASCII are the same in UTF-8 (so you db does not care).

But the German Umlaute are encoded differently in latin-1 and in UTF-8, if you try to interpret an ä in latin-1 as UTF-8 it falls back to the � symbol you've showed above.

Your test print_r(mb_detect_encoding("K�ln")); tells you it is UTF-8, because the �-symbol itself is part of UTF-8. By copying the error string it is probably copying the �-symbol rather than the invalid caracter that used to be in its place

Try to convert your input string to UTF-8 with http://php.net/manual/de/function.mb-convert-encoding.php

Florian Moser
  • 2,583
  • 1
  • 30
  • 40
  • Your thought seems close, but in 99% of all my Umlaut-cases, things are correct. The specific order in front of me right now for example has invoiceAdr "Köln" (correct) and the deliveryAdr "K�ln" (incorrect). – Engle Apr 17 '17 at 11:45
  • Where do you get this data from? Is it imported from another source, or is it from your database? – Florian Moser Apr 17 '17 at 11:46
  • Yes, comes from delivery addresses originally from ebay/Amazon/webshop orders. The cases I notice like the above show this incompatible character ALREADY for example on ebay itself when you view the order details there. I have no idea what people enter sometimes. – Engle Apr 17 '17 at 11:57
  • Oh, and I think my input string already is UTF-8. Are you saying i should just check for that character �? :) – Engle Apr 17 '17 at 11:58
  • If your string are valid UTF-8 this should work. Else you might need some additional checks: http://stackoverflow.com/questions/6723562/how-to-detect-malformed-utf-8-string-in-php – Florian Moser Apr 17 '17 at 12:06
0

It seems in my case the � character is being imported into my DB as is - meaning as a valid UTF-8 character like @Florian Moser mentioned. I will go with simply checking for this character and see where it leaves me in the future.

Engle
  • 45
  • 4
0

SELECT HEX(col) -- what do you get? (Spaces added for clarity.)

4B EFBFBD 6C 6E  -- The input had the black diamond
4B F6     6C 6E  -- you stored latin1, not utf8
4B C3B6   6C 6E  -- correctly stored utf8 (or utf8mb4)

You mentioned Chinese -- You really need to be using utf8mb4, not just utf8. (Köln works the same in both.)

Since there are multiple cases, I recommend you study "Black Diamonds" in Trouble with utf8 characters; what I see is not what I stored

Community
  • 1
  • 1
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Yes, result was '4BEFBFBD6C6E' so this confirms that the replacement character was already part of the incoming data. Nothing I can do from my end than check for it and react accordingly. About chinese.. I didn't have any trouble so far (I'm working on a new version of something already running for a bunch of years) but I will look into it. Thank you for your tips. – Engle Apr 17 '17 at 17:17
  • Only some Chinese characters need 4-byte UTF-8 encodings (utf8mb4), so the Chinese you have seen so far may not have been a problem. – Rick James Apr 17 '17 at 18:52