202

This question looks embarrassingly simple, but I haven't been able to find an answer.

What is the PHP equivalent to the following C# line of code?

string str = "\u1000";

This sample creates a string with a single Unicode character whose "Unicode numeric value" is 1000 in hexadecimal (4096 in decimal).

That is, in PHP, how can I create a string with a single Unicode character whose "Unicode numeric value" is known?

HoldOffHunger
  • 18,769
  • 10
  • 104
  • 133
Telaclavo
  • 2,529
  • 2
  • 17
  • 15
  • read : http://php.net/manual/en/regexp.reference.unicode.php – xkeshav May 19 '11 at 12:20
  • 6
    @diEcho: that's only for matching Unicode characters, but the OP wants to create to those characters. – Stefan Gehrig May 19 '11 at 12:21
  • this may help: http://randomchaos.com/documents/?source=php_and_unicode – xkeshav May 19 '11 at 12:24
  • 1
    possible duplicate of [How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?](http://stackoverflow.com/questions/2934563/how-to-decode-unicode-escape-sequences-like-u00ed-to-proper-utf-8-encoded-cha) – Ariel May 21 '13 at 09:03
  • 2
    This question is 10 years old. The accepted answer is painfully outdated. – HoldOffHunger Mar 11 '21 at 17:08

8 Answers8

267

PHP 7.0.0 has introduced the "Unicode codepoint escape" syntax.

It's now possible to write Unicode characters easily by using a double-quoted or a heredoc string, without calling any function.

$unicodeChar = "\u{1000}";
Blackhole
  • 20,129
  • 7
  • 70
  • 68
  • 1
    This can be used like so: `wordwrap($longLongText, 20, "\u{200B}", true);` ([zero-width space](http://www.fileformat.info/info/unicode/char/200B/index.htm) it is) – sanmai Feb 12 '18 at 01:02
  • 20
    I believe the OP wanted this answer, not the accepted answer. At any rate, when I searched for "Unicode in PHP", it was because I wanted this answer, not the accepted answer. Maybe "\u{abcd}" didn't exist when this question was first asked. If so, the accepted answer should now be moved. – Adam Chalcraft May 29 '19 at 07:07
  • The OP is obviously frustrated with the answers provided so suggests his own answer in a comment on the accepted answer, which may be why that is the accepted answer and this isn't. As Adam suggests, this answer is what he was looking for and given PHP version 7.1.33 was out when he asked, I suspect this would have been the accepted answer if it wasn't posted 2 years too late. – Professor of programming Nov 18 '21 at 19:12
  • I agree with Adam, this should be the correct accepted answer now. – David Dec 01 '22 at 17:55
  • This works for date format strings, too. `date("l, F\u{00A0}j, Y")` – mbomb007 Mar 28 '23 at 18:41
191

Because JSON directly supports the \uxxxx syntax the first thing that comes into my mind is:

$unicodeChar = '\u1000';
echo json_decode('"'.$unicodeChar.'"');

Another option would be to use mb_convert_encoding()

echo mb_convert_encoding('က', 'UTF-8', 'HTML-ENTITIES');

or make use of the direct mapping between UTF-16BE (big endian) and the Unicode codepoint:

echo mb_convert_encoding("\x10\x00", 'UTF-8', 'UTF-16BE');
Stefan Gehrig
  • 82,642
  • 24
  • 155
  • 189
  • 4
    @Gumbo: I know that but it doesn't make any difference in here. Javascript as well as JSON support the `\uxxxx` Unicode syntax so you can use `json_decode` to work on an artifically created JSON string representation. I changed the wording though to have that clarified. – Stefan Gehrig May 19 '11 at 12:48
  • 4
    Ok, so the strict formulation of one answer to my question is: $str=json_decode('"\u1000"'); Thank you. – Telaclavo May 19 '11 at 15:48
  • I tried `echo json_decode('\u201B');` Which referes to a [*single reverted quote*](http://en.wikipedia.org/wiki/Quotation_mark_glyphs) However it isn't working, meaning no output (even if piped to `hd`) – hek2mgl Jul 23 '14 at 12:52
  • 4
    You need `echo json_decode('"\u201B"');`. Double quotes around the unicode symbol are mandatory. – Stefan Gehrig Jul 23 '14 at 14:04
  • Are there some PHP consts to use instead of the plain string `'HTML-ENTITIES'` and `'UTF-8'`? – Xenos Nov 05 '18 at 08:56
  • `echo json_decode('"\u201B"');` doesn't work for me, you need double slash here as well as doublequotes: `echo json_decode('"\\u201B"');` – Ivan Shatsky Oct 30 '20 at 17:16
  • @IvanShatsky This shouldn't really matter: https://3v4l.org/8o0c4 – Stefan Gehrig Oct 31 '20 at 06:43
24

I wonder why no one has mentioned this yet, but you can do an almost equivalent version using escape sequences in double quoted strings:

\x[0-9A-Fa-f]{1,2}

The sequence of characters matching the regular expression is a character in hexadecimal notation.

ASCII example:

<?php
    echo("\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21");
?>

Hello World!

So for your case, all you need to do is $str = "\x30\xA2";. But these are bytes, not characters. The byte representation of the Unicode codepoint coincides with UTF-16 big endian, so we could print it out directly as such:

<?php
    header('content-type:text/html;charset=utf-16be');
    echo("\x30\xA2");
?>

If you are using a different encoding, you'll need alter the bytes accordingly (mostly done with a library, though possible by hand too).

UTF-16 little endian example:

<?php
    header('content-type:text/html;charset=utf-16le');
    echo("\xA2\x30");
?>

UTF-8 example:

<?php
    header('content-type:text/html;charset=utf-8');
    echo("\xE3\x82\xA2");
?>

There is also the pack function, but you can expect it to be slow.

Pacerier
  • 86,231
  • 106
  • 366
  • 634
  • 1
    Perfect for when copy/pasting a bullet character (\xE2\x80\xA2) could result in a UTF-8 encoding error in the source document. Thank you. – jimp Feb 05 '16 at 00:32
23

PHP does not know these Unicode escape sequences. But as unknown escape sequences remain unaffected, you can write your own function that converts such Unicode escape sequences:

function unicodeString($str, $encoding=null) {
    if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
    return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
}

Or with an anonymous function expression instead of create_function:

function unicodeString($str, $encoding=null) {
    if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
    return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', function($match) use ($encoding) {
        return mb_convert_encoding(pack('H*', $match[1]), $encoding, 'UTF-16BE');
    }, $str);
}

Its usage:

$str = unicodeString("\u1000");
Gumbo
  • 643,351
  • 109
  • 780
  • 844
11
html_entity_decode('&#x30a8;', 0, 'UTF-8');

This works too. However the json_decode() solution is a lot faster (around 50 times).

flori
  • 14,339
  • 4
  • 56
  • 63
6

Try Portable UTF-8:

$str = utf8_chr( 0x1000 );
$str = utf8_chr( '\u1000' );
$str = utf8_chr( 4096 );

All work exactly the same way. You can get the codepoint of a character with utf8_ord(). Read more about Portable UTF-8.

Hamid Sarfraz
  • 1,089
  • 1
  • 14
  • 34
4

As mentioned by others, PHP 7 introduces support for the \u Unicode syntax directly.

As also mentioned by others, the only way to obtain a string value from any sensible Unicode character description in PHP, is by converting it from something else (e.g. JSON parsing, HTML parsing or some other form). But this comes at a run-time performance cost.

However, there is one other option. You can encode the character directly in PHP with \x binary escaping. The \x escape syntax is also supported in PHP 5.

This is especially useful if you prefer not to enter the character directly in a string through its natural form. For example, if it is an invisible control character, or other hard to detect whitespace.

First, a proof example:

// Unicode Character 'HAIR SPACE' (U+200A)
$htmlEntityChar = "&#8202;";
$realChar = html_entity_decode($htmlEntityChar);
$phpChar = "\xE2\x80\x8A";
echo 'Proof: ';
var_dump($realChar === $phpChar); // bool(true)

Note that, as mentioned by Pacerier in another answer, this binary code is unique to a specific character encoding. In the above example, \xE2\x80\x8A is the binary coding for U+200A in UTF-8.

The next question is, how do you get from U+200A to \xE2\x80\x8A?

Below is a PHP script to generate the escape sequence for any character, based on either a JSON string, HTML entity, or any other method once you have it as a native string.

function str_encode_utf8binary($str) {
    /** @author Krinkle 2018 */
    $output = '';
    foreach (str_split($str) as $octet) {
        $ordInt = ord($octet);
        // Convert from int (base 10) to hex (base 16), for PHP \x syntax
        $ordHex = base_convert($ordInt, 10, 16);
        $output .= '\x' . $ordHex;
    }
    return $output;
}

function str_convert_html_to_utf8binary($str) {
    return str_encode_utf8binary(html_entity_decode($str));
}
function str_convert_json_to_utf8binary($str) {
    return str_encode_utf8binary(json_decode($str));
}

// Example for raw string: Unicode Character 'INFINITY' (U+221E)
echo str_encode_utf8binary('∞') . "\n";
// \xe2\x88\x9e

// Example for HTML: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_html_to_utf8binary('&#8202;') . "\n";
// \xe2\x80\x8a

// Example for JSON: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_json_to_utf8binary('"\u200a"') . "\n";
// \xe2\x80\x8a
Timo Tijhof
  • 10,032
  • 6
  • 34
  • 48
0
function unicode_to_textstring($str){

    $rawstr = pack('H*', $str);

    $newstr =  iconv('UTF-16BE', 'UTF-8', $rawstr);
    return $newstr;
}

$msg = '67714eac99c500200054006f006b0079006f002000530074006100740069006f006e003a0020';

echo unicode_to_textstring($str);

chings228
  • 1,859
  • 24
  • 24