1

I am fairly new to using regular expressions and I am stuck on a problem that I am trying to solve. I have issues understanding what's going on and I hope that someone can hint me in the right direction.

What I am trying to achieve:

To avoid duplicates in the view, I want to check if an attribute name contains the respective attribute unit. For example if $attribute['name'] = "Cutting speed (in m/Min.)" and attribute['unit'] = "m/min" the attribute unit should not be displayed as it is already mentioned in the name.

How I am trying to achieve this:

I am checking for the attribute unit by using the following regular expression: ~\b' . attribute['unit'] . '\b~i' This works well in for the above mentioned example, but not so well if the unit is a special character, like % or ", for instance.

The Problems

While testing for the special character issue I came accross the following phenomenon:

if I use this regex /\b%\b/ it behaves not as expected and matches the % in bla%bla but not the % if it is preceded or followed by a space: https://regex101.com/r/56iYEI/3

It seems like the % turns the behavior of the regex to its opposite. I tested with other "special characters" as well (" and &), and they seem to have the same effect.

I was directed to this question (Regular Expression Word Boundary and Special Characters) before and read the answers. I now understand that \b checks for word boundaries. But it is still unclear to me why it behaves the way it does as soon as a % or " turns up.

The questions

  1. How come a % turns this checking for word boundaries by \b around?
  2. How can I achieve my goal to match for alphanumeric units as well as for special character units, like % or "?

Looking forward to any hints. Thanks in advance!

asti.v
  • 27
  • 5
  • A word boundary is a place where a word character is next to a non-word character. Since space and `%` are both non-word characters, there's no word boundary between them. – Barmar Feb 10 '20 at 10:47
  • for find the alone % Use ```\B%\B``` – hosein in jast Feb 10 '20 at 10:48
  • @hoseininjast But that won't work for the earlier case. He's looking for a general solution for any type of unit. – Barmar Feb 10 '20 at 10:49
  • @Barmar jeeez, I can't believe I didn't see this before... "Since space and % are both non-word characters, there's no word boundary between them." This little sentence just clicked so much xD thanks!!! – asti.v Feb 10 '20 at 10:50
  • I'm not sure this can be solved easily with a regexp. It's closing in on natural language processing. – Barmar Feb 10 '20 at 10:51
  • @hoseininjast thanks for your suggestion, I had this option in mind as well, but how would I express this in a pattern? Something like `if special character (% or " or &), do \B%\B else do \bunit\b` – asti.v Feb 10 '20 at 10:57
  • @Barmar ah okay, I see... – asti.v Feb 10 '20 at 10:59

1 Answers1

1

A word break is a point between a string of word characters and a string of non-word characters (or start or end). The non-word characters don't have to be a space.

 foo"@#bar {}qux

In this string the words breaks are before and after foo, bar, and qux.

The expression /\b"@#\b/ will match chars between foo and bar. However /\b"@\b/ will not because there is no word (and thus no word break) after the @.


To solve this, check either a word break or a non-word character. The following expression matches both cases; /(^|\W|\b)"@($|\W|\b)/.

'~(^|\W|\b)' . attribute['unit'] . '($|\W|\b)~i'

P.S. If attribute['unit'] can contain any characters, be sure to quote before using it in the regex using preg_quote().

Arnold Daniels
  • 16,516
  • 4
  • 53
  • 82
  • Thanks so much for the detailed explanation as well as the solution. It did the trick in my case! – asti.v Feb 11 '20 at 09:16