-1

During my work in updating some old projects im working through some old ANSI/ASCII files and encodings. I want to have everything running utf-8 to make sure that i can support all kinds of languages.

I have a service where i send out sms'es using a microservice. I have an endpoint: /sms.php where i accept some parameters from _GET and these are then used in the application. I have some test files where i make some requests to test if everything is ok. My problem is that even though all files are utf8-encoded (i've checked multiple times)

My code looks like this:

$text = "message with æøå to make it utf8";
$params = urlencode($text);
$url = "http://localhost/sms.php?text=".$params;
echo mb_detect_encoding($text, "auto"); // this prints utf8
echo mb_detect_encoding($url, "auto"); // this prints ascii
$res = file_get_contents($url);

And this is also what i see in my receiving endpoint.

First i thought it was something to do with file_get_contents but since its being converted AFTER the urlencode it thought i might be it. But im not sure how to get around this problem. The other problem i have is that a lot of my clients are using this old 2012 code (before i started using utf8 as standard) so i cant change the endpoint without causing them to make changes in their current setups.

In a comment i've been suggested to try to check for if the string is utf8 using bin2hex:

bin2hex($_GET['text']); // 6d657373616765207769746820c3a6c3b8c3a520746f206d616b652069742075746638 which is inserted into the database: message with æøå to make it utf8
bin2hex(utf8_decode($_GET['text'])); // 6d657373616765207769746820e6f8e520746f206d616b652069742075746638 which is inserted into the database: message with æøå to make it utf8

Hope someone out there can point me in a correct direction. I've looked into multiple stackoverflow entries for example get utf8 urlencoded characters in another page using php What's the correct encoding of HTTP get request strings?

but im not sure if what im looking for is even possible? i was just hoping to be able to rewrite entire project to be utf8-ready

Thanks /Wel

Wel Rachid
  • 23
  • 7

1 Answers1

0

mb_detect_encoding gives you the first encoding in which the tested string is valid. If left to its own devices, it tests for ASCII before UTF-8. Since a URL-encoded string consists solely of a subset of ASCII characters, it is valid ASCII and mb_detect_encoding will tell you so. Whereas a string containing non-ASCII characters is not valid ASCII, so it will continue testing other encodings and eventually arrive at UTF-8.

UTF-8 is a superset of ASCII, so any string that is valid ASCII is also valid UTF-8. A string can be valid in multiple encodings at once; mb_detect_encoding telling you it's valid ASCII does not mean that it's not also valid UTF-8, or Latin-1, or numerous other encodings for that matter. That's how Mojibake is born.

Detecting encodings is largely vague nonsense anyway and you should never do that. If you expect a string to be in UTF-8, simply test whether it is valid UTF-8 or not:

mb_check_encoding($url, 'UTF-8')

If it's not valid in the expected encoding, discard it, since you have no clue what it really is then.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • Hi Thanks for reply. I'm sorry i didnt mention that i am well aware of that detection can be ordered so that it starts with utf8 instead of ascii, but doing this doesnt explain why on the receiving end i need to make utf8_decode to be able to read it as utf8 and get it inserted into the db correct. $_GET['text'] = utf8_decode($_GET['text']); – Wel Rachid Jul 20 '18 at 13:08
  • That is a completely different cattle of worms. Likely `$_GET['text']` is perfectly fine UTF-8, and you're simply not treating the encoding settings correctly when inserting into the database. Use `echo bin2hex($_GET['text'])` to see the *bytes* your string consists of and check whether they represent the correct encoding for the characters you expect. – deceze Jul 20 '18 at 13:11
  • i tried that and yes, the string does look like still to contain utf8 - but then inserting it into the db makes it all go away again? i understand this is outside the scope of this question, but perhaps i can change it a little to fit – Wel Rachid Jul 20 '18 at 13:41
  • Likely you need to follow https://stackoverflow.com/a/279279/476 closely. – deceze Jul 20 '18 at 13:43
  • That was exactly it. Using set_charset i was able to make it work in the database aswell.. Thanks! – Wel Rachid Jul 20 '18 at 15:51