2

I want to convert a set of Unicode code points in string format to actual characters and/or HTML entities (either result is fine).

For example, if I have the following string assignment:

$str = '\u304a\u306f\u3088\u3046';

I want to use the preg_replace function to convert those Unicode code points to actual characters and/or HTML entities.

As per other Stack Overflow posts I saw for similar issues, I first attempted the following:

$str = '\u304a\u306f\u3088\u3046';
$str2 = preg_replace('/\u[0-9a-f]+/', '&#x$1;', $str);

However, whenever I attempt to do this, I get the following PHP error:

Warning: preg_replace() [function.preg-replace]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u

I tried all sorts of things like adding the u flag to the regex or changing /\u[0-9a-f]+/ to /\x{[0-9a-f]+}/, but nothing seems to work.

Also, I've looked at all sorts of other relevant pages/posts I could find on the web related to converting Unicode code points to actual characters in PHP, but either I'm missing something crucial, or something is wrong because I can't fix the issue I'm having.

Can someone please offer me a concrete solution on how to convert a string of Unicode code points to actual characters and/or a string of HTML entities?

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
HartleySan
  • 7,404
  • 14
  • 66
  • 119

2 Answers2

6

From the PHP manual:

Single and double quoted PHP strings have special meaning of backslash. Thus if \ has to be matched with a regular expression \\, then "\\\\" or '\\\\' must be used in PHP code.

First of all, in your regular expression, you're only using one backslash (\). As explained in the PHP manual, you need to use \\\\ to match a literal backslash (with some exceptions).

Second, you are missing the capturing groups in your original expression. preg_replace() searches the given string for matches to the supplied pattern and returns the string where the contents matched by the capturing groups are replaced with the replacement string.

The updated regular expression with proper escaping and correct capturing groups would look like:

$str2 = preg_replace('/\\\\u([0-9a-f]+)/i', '&#x$1;', $str);

Output:

おはよう

Expression: \\\\u([0-9a-f]+)

  • \\\\ - matches a literal backslash
  • u - matches the literal u character
  • ( - beginning of the capturing group
    • [0-9a-f] - character class -- matches a digit (0 - 9) or an alphabet (from a - f) one or more times
  • ) - end of capturing group
  • i modifier - used for case-insensitive matching

Replacement: &#x$1

  • & - literal ampersand character (&)
  • # - literal pound character (#)
  • x - literal character x
  • $1 - contents of the first capturing group -- in this case, the strings of the form 304a etc.

RegExr Demo.

Community
  • 1
  • 1
Amal Murali
  • 75,622
  • 18
  • 128
  • 150
  • I said the regex I posted, not mine I personally coded. Take care. – Giacomo1968 Jan 05 '14 at 07:31
  • Amal, thanks for your answer. I knew I had made a stupid mistake, and as it turns out, it was simply a matter of escaping the backslash properly. I had forgotten that four backslashes are required because apparently the string parser turns four into two, and then the regex parser turns two into one. Thank you. As a quick question: What are the "improvements" added to the regex? I noticed that you changed `+` to `{4}`, which I'm not sure is an improvement, because I read somewhere that some Unicode code points can have five hex values. Also... – HartleySan Jan 05 '14 at 07:34
  • How is wrapping `[0-9a-f]{4}` in a capturing group and then adding a `+` to that group an improvement? I don't mean to criticize, I simply want to know. Thanks. – HartleySan Jan 05 '14 at 07:35
  • @JakeGould: FYI: I did not *grab* your regex. As you can see, my regex is **different** from what you've posted. And as I already told, I got the idea from the answer I linked above. Re: "I said the regex I posted, not mine I personally coded." -- you edited your comment. – Amal Murali Jan 05 '14 at 07:39
  • Amal, even though I gave you credit for the right answer (mainly because I liked your explanation of the backslashes), I still don't see where the improvements in the regex are. – HartleySan Jan 05 '14 at 07:45
  • 1
    @HartleySan the reason for the parentheses is to create a capture group which is used in the second parameter of preg_replace (i.e. the `$1`). The `{4}` is to prevent it from matching something with more or less than 4 characters after the `\u` because it will always be exactly 4 characters. – Mike Jan 05 '14 at 08:00
  • Amal, I get the parentheses, but the combination of `{4}` and `+` seems weird. – HartleySan Jan 05 '14 at 08:14
  • @HartleySan: Wikipedia says it could be 4 or 5, so I've updated my answer (also added some explanation for the capturing groups etc). – Amal Murali Jan 05 '14 at 08:44
  • You could also do `\\\\u([0-9a-f]{4,5})` – Mike Jan 06 '14 at 02:02
  • @Mike: Right, that's another option. If the OP wants to do it, sure. But I'm not entirely sure if 5 is even possible, so I left OP's original expression as it is (without additional changes) :P – Amal Murali Jan 06 '14 at 02:04
1

This page here—titled Escaping Unicode Characters to HTML Entities in PHP—seems to tackle it with this nice function:

function unicode_escape_sequences($str){
  $working = json_encode($str);
  $working = preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $working);
  return json_decode($working);
}

That seems to work with json_encode and json_decode to take pure UTF-8 and convert it into Unicode. Very nice technique. But for your example, this would work.

$str = '\u304a\u306f\u3088\u3046';
echo preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $str);

The output is:

おはよう

Which is:

おはよう

Which translates to:

Good morning

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
  • Yes, `json_encode` is precisely the reason why I was looking to do this. Thank you very much. You code works great, and as I mentioned in Amal's answer, the key was properly escaping the backslash. I noticed that you used three backslashes while Amal used four, but both solutions work. Why is that? – HartleySan Jan 05 '14 at 07:36
  • Thanks! But look at my answer and in the first sentence I provide a link to the source of this regex. I did not create it, I am simply sharing what I found. – Giacomo1968 Jan 05 '14 at 07:37
  • Noted. I was just hoping you could offer some insight. I really get confused about escaping backslashes sometimes, and I was hoping to better understand what PHP is doing behind the scenes with all those backslashes. – HartleySan Jan 05 '14 at 07:39
  • 2
    @HartleySan: [This answer](http://stackoverflow.com/a/20819109/1438393) might give you an idea about the differences between three backslashes and four backslashes. – Amal Murali Jan 05 '14 at 07:40
  • The reason it works with 3 or 4 backslashes is because in PHP \u has no special meaning and therefore remains a literal \u, whereas \\ in a string will be converted to a single \. So \\\\ will be converted to a \\ and \\\u will be converted to a \ and the \u will remain the same, meaning both of them will be \\u in the resultant string. – Mike Jan 05 '14 at 07:41
  • Thanks a lot for that explanation, Amal. It makes perfect sense. I ended up giving you credit for the answer because both you and Jake provided very good answers, but I liked your explanation of the backslash thing as well. Thank you. Also, sorry for not giving you credit for the right answer, Jake. It was a close call. – HartleySan Jan 05 '14 at 07:43
  • @Mike: For writing a single backslash, you need >> ` ` \ ` ` << (without spaces). Rendered as: ``\``. – Amal Murali Jan 05 '14 at 07:46
  • 1
    @JakeGould It's a pretty simple regex. I wouldn't say it's necessary conclusive that it was copied from someone else. From the PHP manual (linked in his answer), they recommend four backslashes, and it is probably not that uncommon to know that unicode characters encoded like `\u304a` will always be 4 characters long after the `u`. – Mike Jan 05 '14 at 07:57