0

I am trying to process data that I got using curl, but I have issues with encoding - I am unable to find right way to handle it.

This is the text I got (in HEX) - '6B 64 6F 20 6D C3 A1' that should evaluate to string 'kdo má' but instead of it, it evaluates to 'kdo m??' (actually, the last two chars aren't question marks but http://www.fileformat.info/info/unicode/char/c3/index.htm and http://www.fileformat.info/info/unicode/char/a1/index.htm)

I don't understand why some chars are 8bit and diacritic chars are 16 bit and how should PHP know which one is which, but anyway, how should I decode it?

user10099
  • 1,345
  • 2
  • 17
  • 23
  • _“I don't understand why some chars are 8bit and diacritic chars are 16 bit”_ – because that’s how a [variable-width encoding](http://en.wikipedia.org/wiki/Variable-width_encoding) works … – CBroe Sep 16 '13 at 21:53
  • You're probably getting UTF-8 text, which uses "high" ascii for the extended code sequences (lower 7bits of UTF-8 correspond 1:1 with US-ASCII). But you're probably dumping that UTF text into a different charset's environment, where the UTF-8 hibit escapes have no meaning, e.g. iso-8859. – Marc B Sep 16 '13 at 21:54

1 Answers1

0

don't understand why some chars are 8bit and diacritic chars are 16 bit

Most likely because it's UTF8 or perhaps even UTF16. And by default PHP assumes one character == one byte

and how should PHP know which one is which, but anyway, how should I decode it?

No. You have to tell it. Check mbstring: http://php.net/manual/de/book.mbstring.php or recode: http://php.net/manual/en/book.recode.php

Marcin Orlowski
  • 72,056
  • 11
  • 123
  • 141