UnicodeCategory.Otherletter block range for regex

Question

I need to restrict a text fields length to a variable amount of characters. I say variable because it needs to count CJK ideographs as 2 characters. For example if I were restricting the length to 10 then I could have 10 Latin characters but only 5 ideographs, or 4 Latin and 3 CJK ideographs(4 + (3*2)).

I had this implemented well enough in c# by using:

if (char.GetUnicodeCategory(str, i) == UnicodeCategory.OtherLetter)

The thing is this was being checked on a form post, what I really want is to have a javascript implementation to check as the user is typing. I could use a regex to check each char but I cannot find out which unicode block ranges UnicodeCategory.OtherLetter uses.

This site seems really helpful for putting together the regex but I just need to know what I'm looking for to match the c# implementations behaviour.

Perl has a Unicode property called 'OtherLetter' - `\p{Lo}` but I don't know if JS supports Unicode, or if it does, supports otherletter. — , Oct 29 '13 at 15:35

score 4 · Accepted Answer · edited May 23 '17 at 12:06

C#

Firstly, if your goal is to count only the CJK ideographs as 2 characters, then the current C# code you have isn't quite right. The Unicode General Category OtherLetter is more or less intended for scripts that have no concept of letter case. This means that not only would CJK characters match, but so would Arabic, Hebrew, Khmer, Georgian, etc. In the Unicode data, the CJK characters are called the Han script.

Unfortunately, I could not find an easy solution within the .NET Framework to check for the script of a character. You can, however, use .NET Regex to match Unicode Blocks. Just match the necessary CJK blocks in addition to the general category. Unfortunately, though Unicode tries to keep the blocks homogeneous, they makes no guarantees that errant characters from other scripts could end up in "wrong" blocks. I imagine this is unlikely with the CJK blocks though.

Also, a minor issue is that you might want to consider using System.Globalization.CharUnicodeData.GetUnicodeCategory(str, i) instead of char.GetUnicodeCategory(str, i). The CharUnicodeData version is meant to be up to date with the current version of Unicode, while the other may not be, for backwards compatibility reasons.

JavaScript

Unfortunately, JavaScript's Unicode support is not that good, especially when it comes to regexes. It has actually already been asked if there was a way to get the general category in JavaScript. It appears that there is not, but the answers there mention the XRegExp plugin, which can check for a character's general category, in addition to its script.

Mathias Bynens has a great article detailing JavaScript's current shortcomings with Unicode and improvements expected in the upcoming ECMAScript 6. He also provides links to polyfills for these improvements.

While ECMAScript 6 provides much better support for astral characters, a quick glance at the current draft (Oct. 28, 2013, rev. 20) shows no sign of including support to match Unicode General Categories, blocks or scripts.

Astral Characters

Astral characters are those which are found in planes beyond the Basic Multilingual Plane (BMP, Plane 0), that is characters with values greater than 0xFFFF. Both C# and JavaScript use UTF-16 as their string encoding. This means that the characters are actually formed with 2 code units instead of 1 as in the BMP. My answer to a previous Unicode question goes into a little more detail about the encoding, but suffice to say, this can wreak havoc. In particular, the string length for astral characters is 2, and regex engines have a hard time dealing with them.

Neither the C# blocks, nor the XRegExp solutions actually properly deal with astral characters. Many of the rarer CJK characters are located in the Supplementary Ideographic Plane (SIP, Plane 2). That said, "character" is an overloaded term, and has been used to mean "code unit", "code point", and "user-perceived character". For this answer, I've been using it to mean code point, but I can't tell which one you mean, so the best I can do is to make you aware of the issues of astral characters.

Note that though it hasn't yet been released, XRegExp's GitHub repository indicates that they have already implemented support for astral characters in the upcoming version 3.

Manually Matching

Given all the difficulties, it might just be best to use a regex to manually match all appropriate code points. The downfall of this of course is that it would have to be updated when new CJK characters are added to the standard. The code points for the CJK ideographs can be found in the Unicode script data by searching for the "Han" script and then taking the ranges indicated by Lo (Letter, other). The corresponding regex which should work (though not tested) in C# and JavaScript would be:

[\u3400-\u4DB5\u4E00-\u9FCC\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868][\uDCOO-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|[\uD86A-\uD86C][\uDCOO-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDCOO-\uDC1D]|\uD87E[\uDC00-\uDE1D]

Depending on your definition, the code points 3005, 3007, 3021-3029, 3038-303A, 303B may or may not be considered ideographs. They have the categories Lm and Nl for "Letter, modifier" and "Number, letter".

this is an example of an answer that should get hundreds of upvotes and this not happening is often a criticism of stackoverflow - maybe should put it in meta somewhere and suggest a bonus system or something? — Cel, Apr 17 '15 at 09:44

UnicodeCategory.Otherletter block range for regex

1 Answers1

C#

JavaScript

Astral Characters

Manually Matching