Regexp word boundaries in non-ASCII situations

Question

I have a regular expression in my PHP script like this:

/(\b$term|$term\b)(?!([^<]+)?>)/iu

This matches the word contained in $term, as long as there's a word boundary before or after and it's not inside a HTML tag.

However, this doesn't work in non-ASCII cases, for example with Russian text. Is there a way to make it work?

I can get almost as good result with

/(\s$term|$term\s)(?!([^<]+)?>)/iu

but this is obviously more limited and since this regexp is about highlighting search terms, it has the problem of including the space in the highlight.

I've read this StackOverflow question about the problem, but it doesn't help - doesn't work correctly. In that example the captures are the other way around (capture text outside the search term, when I need to capture the search term).

Any way to make this work? Thanks!

score 2 · Accepted Answer · answered Apr 14 '11 at 17:40

2

You could use zero-width lookahead/lookbehind assertions to assert the that characters to the left and right of what you're matching are non-letters?

answered Apr 14 '11 at 17:40

Amber

507,862
82
626
550

1

Thanks. I ended up with this: /(?<=[\s.,;:])($term)(?!([^<]+)?>)/iu It works pretty well. But how do I do the other side? I tried (?<=[\s.,;:])($term|$term)(?=[\s.,;:])(?!([^<]+)?>)/iu but that doesn't work. Well, this isn't that critical - I'm not even sure it's a good idea to match terms that appear in the end of the words. – Mikko Saari Apr 15 '11 at 03:54

score 0 · Answer 2 · answered Apr 14 '11 at 18:46

0

The \b is certainly defined to work perfectly well on Unicode, as is required by UTS#18. What are you saying it is not doing? What are the exact text strings involved?

answered Apr 14 '11 at 18:46

tchrist

78,834
30
123
180

The code is from a WordPress search plugin. It'll go through post content and should pick up all the occurances of the search term where there search term is not completely inside another word (and not inside a HTML tag). – Mikko Saari Apr 15 '11 at 03:36
From this text, searching for "програ" does not match correctly with "\bпрогра|програ\b". програ Во предлагаю электронной там. Стал лучше платформу мы там, руки принять нью по, работе мешают дни за.програ Спольски программы безусловно их без. Три может обычно бы, больше разные вы где. Две то буду чёртов фактически, работать преодолеть по кто. Том внешних закончить безответственный ты. Кремнияпрогра электпрограронной не всю, том до дурак команды. Об тд ваших программировать, но нас интервью процессорах. – Mikko Saari Apr 16 '11 at 05:49
1

@Mikko, I believe I know what’s going on. I just tried that using Perl, which should be the same as your preg match in PHP. Here’s the deal: **if and only if** you store the string and pattern as UTF-8, it correctly matches, but if those literals are considered bytes instead of characters, that same pattern fails to match. In Perl, you just have to say `use utf8;` at the top of your program, and then all the string ops, including matching, work fine with those UTF-8 literals. (We don’t have two flavors of ops.) But if you *don’t* do that, it “mysteriously” fails. Could that be your problem? – tchrist Apr 16 '11 at 06:35
1

@Mikko: it looks like you have to use the `/u` in PHP to make it realize that it is dealing with Unicode. Ouch. – tchrist Apr 16 '11 at 23:10
1

Yes, and as you can see from my original regexp, I'm using /u - but it doesn't help. "Even in UTF-8 mode, standard class shorthands like \w and \b are not Unicode-aware.", says Alan Moore in [this StackOverflow questions](http://stackoverflow.com/questions/2432868/php-regex-word-boundary-matching-in-utf-8/2449017#2449017). – Mikko Saari Apr 17 '11 at 03:39
@Mikko: Well, that’s utterly broken then. It works perfectly well in Perl. This is by explicit design, and because doing so is mandatory for standards compliance. Somehow somebody screwed up the Perl => PCRE => PHP technology transfer, and in a bad way, since now it no longer meets the formal requirements of [UTS#18 Unicode Regular Expressions](http://unicode.org/reports/tr18/#Compatibility_Properties). What is it with all these myopic monoglots and all their archaic ASCII provincialisms when it comes to Unicode and regexes? So many languages screw this up. This **REALLY** needs to be fixed. – tchrist Apr 17 '11 at 05:25
This is now working for me with the /u modifier. Maybe PHP has changed or fixed something since 2011 when this was asked? – joachim Sep 18 '15 at 16:09
@joachim, yes, there was a bug in older versions of php when using \b with unicode characters. The bug was fixed a few years ago. – kojow7 Jun 29 '17 at 18:22

Regexp word boundaries in non-ASCII situations

2 Answers2