3

Let's say (for simplicity's sake) that I have a multibyte, UTF-8 encoded string variable with 3 letters (consisting of 4 bytes):

$original = 'Fön';

Since it's UTF-8, the bytes' hex values are (excluding the BOM):

46 C3 B6 6E

As the $original variable is user-defined, I will need to hande two things:

  1. Get the exact number of bytes (not UTF-8 characters) used in the string, and
  2. A way to access each individual byte (not UTF-8 character).

I would tend to use strlen() to handle "1.", and access the $original variable's bytes with a simple `$original[$byteposition] like this:

<?php
header('Content-Type: text/html; charset=UTF-8');

$original = 'Fön';
$totalbytes = strlen($original);
for($byteposition = 0; $byteposition < $totalbytes; $byteposition++)
{
    $currentbyte = $original[$byteposition];

    /*
        Doesn't work since var_dump shows 3 bytes.
    */
    var_dump($currentbyte);

    /*
        Fails too since "ord" only works on ASCII chars.
        It returns "46 F6 6E"
    */
    printf("%02X", ord($currentbyte));
    echo('<br>');
}

exit();
?>

This proves my initial idea is not working:

  1. var_dump shows 3 bytes
  2. printf fails too since "ord" only works on ASCII chars

How can I get the single bytes from a multibyte PHP string variable in a binary-safe way?

What I am looking for is a binary-safe way to convert UTF-8 string(s) into byte-array(s).

e-sushi
  • 13,786
  • 10
  • 38
  • 57
  • If strlen returns a character count rather than a byte count, then check php.ini for the value of [mbstring.func_overload](http://php.net/manual/en/mbstring.overload.php); but are you sure your `ö` is a UTF-8 character and not simply [extended ASCII](http://www.ascii-code.com/)? F6 is the hex code for `ö` in extended ascii – Mark Baker Aug 01 '13 at 11:46
  • 1
    just an idea: `$a = utf8_encode('Fön'); $b = unpack('C*', $a); var_dump($b);` the result is an array with 4 int values, i utf8_encoded because i had an iso-file. – steven Aug 01 '13 at 11:49
  • and you can find an uniord function in the comments here: http://us.php.net/manual/en/function.ord.php (search for "uniord") – steven Aug 01 '13 at 11:51
  • @MarkBaker Yes, I am sure it's UTF-8 as a memory-dump and a file-dump both show `ö` is correctly represented as `C3 B6`, which fits UTF-8 and not extended ASCII (which would be represented by 1 byte). – e-sushi Aug 01 '13 at 12:10

2 Answers2

6

you can get a bytearray by unpacking the utf8_encoded string $a:

$a = utf8_encode('Fön');
$b = unpack('C*', $a); 
var_dump($b);

used format C* for "unsigned char"

References

RiggsFolly
  • 93,638
  • 21
  • 103
  • 149
steven
  • 4,868
  • 2
  • 28
  • 58
0

I actually wrote my own class for this problem.
I was trying to make the javascript new TextEncoder("utf-8").encode(...) in PHP.
So this is what i came up with: It uses the PHP
ord() function for getting the bytes
and the chr() function for building the utf8 message back

class Uint8Array{
    public $val = array();
    public $length = 0;
    function from($string, $mode = "utf8"){
      if($mode == "utf8"){
      $arr = [];
      foreach (str_split($string) as $chr) {
        $arr[] = ord($chr);
      }
      $this->val = $arr;
      $this->length = count($arr);
      return $arr;
      }
      elseif($mode == "hex"){
      $arr = [];
      for($i=0;$i<strlen($string);$i++){
        if($i%2 == 0)
          $arr[] = hexdec($string[$i].$string[$i+1]);
      }
      $this->val = $arr;
      $this->length = count($arr);
      return $arr;
      }
    }
    function toString($enc = "utf8"){
      if($enc == "utf8"){
          $str = "";
        foreach($this->val as $byte){
          $str .= chr($byte);
        }
        return $str;
      }
      elseif($enc == "hex"){
        $str = "";
        foreach($this->val as $byte){
          $str .= str_pad(dechex($byte),2,"0",STR_PAD_LEFT);
        }
        return $str;
      }
    }
  }

use it like this:

create instance:

$handle = new Uint8Array;

input with ->from(string, encoding) like this: 1)utf8 2)hex bytes(without spaces)

$handle->from("Fön","utf8");
//or with hex bytes
$handle->from("46c3b66e","hex");

output with ->toString(encoding) hex/utf8:

$to_utf8 = $handle->toString("utf8");
//Fön
$to_hex = $handle->toString("hex");
//46c3b66e

the byte-array itself can be found at ->val as you can see here:

$bytearray = $handle->val;
//[70, 195, 182, 110]
$arrayleng = $handle->length;
//4

that is all, be free to use this!

You can learn more about used functions here:
chr() ord()