0

Background

I am writing a DFA based regex parser, for performance reasons, I need to use a dictionary [Unicode.Scalar : State] to map the next states. Now I need a bunch of special unicode values to represent special character expressions like ., \w, \d...

My Question

Which of the unicode values are safe to use for this purpose?

I was using U+0000 for ., but I need more now. I checked the unicode documentation, the Noncharacters seems promising, but in swift, those are considered invalid unicode. For example, the following code gives me a compiler error Invalid unicode scalar.

let c = "\u{FDD0}"
dawnstar
  • 507
  • 5
  • 10
  • Is it a regex parser or does it convert regular expressions to DFAs? Representing `\w` as a codepoint is silly anyway ­– what happens when you get to `[a-zA-Z0-9_]`? Couldn’t represent every possible character class as a distinct codepoint even if you wanted to (but you wouldn’t want to because it’d be bad design). – Ry- Aug 03 '17 at 00:41
  • @Ryan It is a DFA based regex parser, which means it converts regex syntax into DFA, and use that to do matching work. Range expression will be enumerated, i.e. `[0-9]` -> `[0123456789]`. My original implementation was to pre-process `\w` into `[a-zA-Z0-9_-]`, but that means I will have to generate 64 states in my DFA for `\w`. Which leads me to the idea of using a codepoint (only 1 state) instead. – dawnstar Aug 03 '17 at 04:58

1 Answers1

1

If you insist on using Unicode.Scalar, nothing. Unicode.Scalar is designed to represent all valid characters in Unicode, including not-assigned-yet code points. So, it cannot represent noncharacters nor dangling surrogates ("\u{DC00}" also causes error).

And in Swift, String can contain all valid Unicode.Scalars including U+0000. For example "\u{0000}" (== "\0") is a valid String and its length (count) is 1. With using U+0000 as a meta-character, your code would not work with valid Swift Strings.


Consider using [UInt32: State] instead of [Unicode.Scalar: State]. Unicode uses only 0x0000...0x10FFFF (including noncharacters), so using values greater than 0x10FFFF is safe (in your meaning).

Also getting value property of Unicode.Scalar takes very small cost and its cost can be ignored in optimized code. I'm not sure using Dictionary is really a good way to handle your requirements, but [UInt32: State] would be as efficient as [Unicode.Scalar: State].

OOPer
  • 47,149
  • 6
  • 107
  • 142
  • This is exactly what I needed to know. Thanks. BTW, what do you think is a better solution than `Dictionary` for my scenario? – dawnstar Aug 03 '17 at 05:01
  • @dawnstar, that depends on your implementation details. _Not sure_ does not mean _I have a better idea_. Hope you can find the best way. (It may be the Dictionary you described.) Good luck. – OOPer Aug 03 '17 at 06:06