Generating all Unicode characters not in ASCII scheme in PHP?

Question

This regular expression is supposed to match all non-ASCII characters, 0-128 code points:

 /[^x00-x7F]/i

Imagine I want to test (just out of curiosity) this regular expression with all Unicode characters, 0-1114111 code points.

Generating this range maybe simple with range(0, 1114111). Then I should covert each decimal number to hexadecimal with dechex() function.

After that, how can i convert the hexadecimal number to the actual character? And how can exclude characters already in ASCII scheme?

@JvdBerg did i say utf8? I'm trying for example to generate a random printable string... — gremo, Sep 20 '12 at 17:35
@Gremo Unicode is a standard, while UTF-8, UTF-16, and others are character sets - implementations of unicode. I think most people will assume you're working with UTF-8, but that may not be the case. — Izkata, Sep 20 '12 at 17:57

score 0 · Accepted Answer · edited May 23 '17 at 11:43

It depends on how you are going to do the matching and whether you are going to put the PCRE regex engine into UTF-8 mode with the /u modifier.

If you do use the /u modifier then first of all you must use UTF-8 encoding for both the regular expression and the subject and the regex engine will automatically interpret legal UTF-8 byte sequences as just one character. In this mode the regular expression [^x00-x7F] will match all characters outside the Latin-1 supplement block, including those with code points greater than 255. You will also need to generate the UTF-8 representations of each character (given its code point) manually.

If you do not use the /u modifier then the regex engine will be dumb: it will consider each byte as a separate character, which means that you have to work at byte rather than character level. On the other hand, you will now be able to work with any encoding you prefer. However, you will have to ditch the [^x00-x7F] regex (because it's only going to be matching random bytes in the string) and work with a regular expression that embodies the rules of your chosen encoding (example for UTF-8). To generate the encoded forms of random characters you will again need to use custom code that depends on the specific encoding.

score 0 · Answer 2 · answered Sep 20 '12 at 18:08

I think the hex2bin(string) function will convert a hex string into a binary string. To exclude ASCII character codepoints, just begin from the x80 hex codepoint (skipping x00 to x7F).

But it does sort of sound like you're trying to unit test the regex library, which seems unnecessary unless you are developing the regex library, or you need to be extremely paranoid.

Generating all Unicode characters not in ASCII scheme in PHP?

2 Answers2