Regular Expression that matches attribute units in attribute names including special characters

Question

I am fairly new to using regular expressions and I am stuck on a problem that I am trying to solve. I have issues understanding what's going on and I hope that someone can hint me in the right direction.

What I am trying to achieve:

To avoid duplicates in the view, I want to check if an attribute name contains the respective attribute unit. For example if $attribute['name'] = "Cutting speed (in m/Min.)" and attribute['unit'] = "m/min" the attribute unit should not be displayed as it is already mentioned in the name.

How I am trying to achieve this:

I am checking for the attribute unit by using the following regular expression: ~\b' . attribute['unit'] . '\b~i' This works well in for the above mentioned example, but not so well if the unit is a special character, like % or ", for instance.

The Problems

While testing for the special character issue I came accross the following phenomenon:

if I use this regex /\b%\b/ it behaves not as expected and matches the % in bla%bla but not the % if it is preceded or followed by a space: https://regex101.com/r/56iYEI/3

It seems like the % turns the behavior of the regex to its opposite. I tested with other "special characters" as well (" and &), and they seem to have the same effect.

I was directed to this question (Regular Expression Word Boundary and Special Characters) before and read the answers. I now understand that \b checks for word boundaries. But it is still unclear to me why it behaves the way it does as soon as a % or " turns up.

The questions

How come a % turns this checking for word boundaries by \b around?
How can I achieve my goal to match for alphanumeric units as well as for special character units, like % or "?

Looking forward to any hints. Thanks in advance!

A word boundary is a place where a word character is next to a non-word character. Since space and `%` are both non-word characters, there's no word boundary between them. — Barmar, Feb 10 '20 at 10:47
@hoseininjast But that won't work for the earlier case. He's looking for a general solution for any type of unit. — Barmar, Feb 10 '20 at 10:49
@Barmar jeeez, I can't believe I didn't see this before... "Since space and % are both non-word characters, there's no word boundary between them." This little sentence just clicked so much xD thanks!!! — asti.v, Feb 10 '20 at 10:50
I'm not sure this can be solved easily with a regexp. It's closing in on natural language processing. — Barmar, Feb 10 '20 at 10:51
@hoseininjast thanks for your suggestion, I had this option in mind as well, but how would I express this in a pattern? Something like `if special character (% or " or &), do \B%\B else do \bunit\b` — asti.v, Feb 10 '20 at 10:57

Arnold Daniels · Accepted Answer · 2020-02-11T03:41:29.530

A word break is a point between a string of word characters and a string of non-word characters (or start or end). The non-word characters don't have to be a space.

 foo"@#bar {}qux

In this string the words breaks are before and after foo, bar, and qux.

The expression /\b"@#\b/ will match chars between foo and bar. However /\b"@\b/ will not because there is no word (and thus no word break) after the @.

To solve this, check either a word break or a non-word character. The following expression matches both cases; /(^|\W|\b)"@($|\W|\b)/.

'~(^|\W|\b)' . attribute['unit'] . '($|\W|\b)~i'

P.S. If attribute['unit'] can contain any characters, be sure to quote before using it in the regex using preg_quote().

Thanks so much for the detailed explanation as well as the solution. It did the trick in my case! — asti.v, Feb 11 '20 at 09:16

Regular Expression that matches attribute units in attribute names including special characters

1 Answers1