Regex: Confusion about \b behavior

Question

I am writing my own grammar for the Atom editor. It uses regular expressions to identify code that should be highlighted.

I'm want to highlight hex numbers in two formats:

$affe     // hex number with ampersand
0xeffa    // hex number with 0x

So I came up with this regex:

(\$|0x)[A-Fa-f0-9]+

Pretty straightforward, this works fine. The problem is that this will also highlight something like 0x0 in t0x0t. So I modified my regex to

\b(\$|0x)[A-Fa-f0-9]+\b

Now, this regex will only match 0xeffa but not $affe, or any other number prefixed with $ - why is that? I found this answer that seems to be a similar problem. I assume this is because $ is a non-word character. Is there a way to modify this regex so it matches both 0xeffa and $affe but not 0x0 in t0x0t?

One solution that I tried is to simply write two separate regex expression for each case - it works but it somehow seems to defeat the purpose of regex expressions.

`\b` marks the transition between a word character `[A-Za-z0-9_]` and a non-word character. But `$` is a non-word character, while `0` is a word-character. So putting it in front of the `($|0x)` would only match the $ if it's preceded by a word character. — LukStorms, May 24 '18 at 19:01

score 1 · Accepted Answer · answered May 24 '18 at 19:11

1

You should change position of where \b applies:

(\$|\b0x)[A-Fa-f0-9]+\b

Otherwise with \b preceding $ engine expects a word character from set [a-zA-Z0-9_] appears right before $ i.e. a$af00

answered May 24 '18 at 19:11

revo

47,783
14
74
117

Regex: Confusion about \b behavior

1 Answers1