2

I am writing my own grammar for the Atom editor. It uses regular expressions to identify code that should be highlighted.

I'm want to highlight hex numbers in two formats:

$affe     // hex number with ampersand
0xeffa    // hex number with 0x

So I came up with this regex:

(\$|0x)[A-Fa-f0-9]+

Pretty straightforward, this works fine. The problem is that this will also highlight something like 0x0 in t0x0t. So I modified my regex to

\b(\$|0x)[A-Fa-f0-9]+\b

Now, this regex will only match 0xeffa but not $affe, or any other number prefixed with $ - why is that? I found this answer that seems to be a similar problem. I assume this is because $ is a non-word character. Is there a way to modify this regex so it matches both 0xeffa and $affe but not 0x0 in t0x0t?

One solution that I tried is to simply write two separate regex expression for each case - it works but it somehow seems to defeat the purpose of regex expressions.

koalag
  • 133
  • 1
  • 16
  • Put `\b` inside `(\$|\b0x)[A-Fa-f0-9]+\b` – revo May 24 '18 at 18:49
  • 1
    `\b` marks the transition between a word character `[A-Za-z0-9_]` and a non-word character. But `$` is a non-word character, while `0` is a word-character. So putting it in front of the `($|0x)` would only match the $ if it's preceded by a word character. – LukStorms May 24 '18 at 19:01

1 Answers1

1

You should change position of where \b applies:

(\$|\b0x)[A-Fa-f0-9]+\b

Otherwise with \b preceding $ engine expects a word character from set [a-zA-Z0-9_] appears right before $ i.e. a$af00

revo
  • 47,783
  • 14
  • 74
  • 117