Working with Unicode Blocks in Regex

Question

I am trying to add a feature that works with certain unicode groups from a string. I found this question that suggests the following solution, which does work on the unicodes inside of the stated range:

s = Regex.Replace(s, @"[^\u0000-\u007F]", string.Empty);

This works fine.

In my research, though, I came across the use of unicode blocks, which I find to be far more readable.

InBasic_Latin =  U+0000–U+007F

More often, I saw recommendations pointing people to use the actual codes themselves (\u0000-\u007F) rather than these blocks (InBasic_Latin). I could see the benefit of explicitly declaring a range when you need some subset of that block or a specific unicode, but when you really just want that entire grouping using the block declaration it seems more friendly to readability and even programmability to use the block name instead.

So, generally, my question is why would \u0000–\u007F be considered a better syntax than InBasic_Latin?

In what language? Many languages, libraries, and programs have regex capabilities, and they differ in their syntax and capabilities. — John Bollinger, Feb 17 '15 at 18:18
For my case, I am using C#, but I was wondering more generally. The fact that some support, some don't, actually hits on the nature of my question. Thanks — getglad, Feb 17 '15 at 18:21
I'm not sure you've fully grasped my comment. My point is that your question is not sensical, because regex is not a single thing that we can talk about in any detail, but rather a family of distinct things that differ not only in the area you want to discuss, but in other areas, too. Some of them are not suitable for general Unicode text *at all*, so that the answer for them is "neither approach is any good". There is no general answer. — John Bollinger, Feb 17 '15 at 18:57

score 1 · Accepted Answer · answered Feb 17 '15 at 18:18

1

It depends on your regex engine, but some (like .NET, Java, Perl) do support Unicode blocks:

if (Regex.IsMatch(subjectString, @"\p{IsBasicLatin}")) {
    // Successful match
}

Others don't (like JavaScript, PCRE, Python, Ruby, R and most others), so you need to spell out those codepoints manually or use an extension like Steve Levithan's XRegExp library for JavaScript.

answered Feb 17 '15 at 18:18

Tim Pietzcker

328,213
58
503
561

Note that in Ruby there is the [`unicode-block`](https://github.com/lpm11/unicode-block) library that allows you to use Unicode Blocks as constants, e.g. `"test" =~ UnicodeBlock::BASIC_LATIN_REGEXP`. Further, Ruby has support for different classes based on Unicode General Category as [described here](http://ruby-doc.org//core-2.2.0/Regexp.html#class-Regexp-label-Character+Properties). – Phrogz Feb 17 '15 at 18:37

Working with Unicode Blocks in Regex

1 Answers1