How Can I Use Look-Ahead and Look-Behind to Create a Custom Boundary Matcher?

Question

I want to split a String at the word boundaries using Scanner. Normally, this would be done like this:

Scanner scanner = new Scanner(...).useDelimiter("\\b");

The problem is that my definition of "word" character is a tiny bit different from the standard [a-zA-Z_0-9] as I want to include some more characters and exclude the _: [a-zA-Z0-9#/]. Therefore, I can't use the \b pattern.

So I tried to do the same thing using look-ahead and look-behind, but what I came up with didn't work:

(<?=[A-Za-z0-9#/])(?![A-Za-z0-9#/])|(<?![A-Za-z0-9#/])(?=[A-Za-z0-9#/])

The scanner doesn't split anywhere using this.

Is it possible to do this using look-ahead and look-behind and how?

Just a minor point, but your "standard" definition of `\b` is also wrong. — Mark Byers, Oct 22 '12 at 14:07
I didn't give one, but I assume it is something like `(?<=\w)(?!\w)|(?<!\w)(?=\w)`. — rolve, Oct 22 '12 at 14:26
That's how it's *supposed* to be defined, and if you use Java 7 and its new [UNICODE_CHARACTER_CLASS](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS) mode, it is. But Java's legacy `\b` is a bit more...creative. See [this question](http://stackoverflow.com/q/4304928/20938) for details, especially @tchrist's answer. — Alan Moore, Oct 22 '12 at 15:36

score 3 · Accepted Answer · answered Oct 22 '12 at 14:11

3

There's an error in your syntax. The ? comes first:

(?<=[A-Za-z0-9#/])(?![A-Za-z0-9#/])|(?<![A-Za-z0-9#/])(?=[A-Za-z0-9#/])
 ^^                                  ^^

answered Oct 22 '12 at 14:11

Andrew Cheong

29,362
15
90
145

score 1 · Answer 2 · answered Oct 22 '12 at 14:17

1

new Scanner(...).useDelimiter(
  "(?<=[a-zA-Z0-9#/])(?=[^a-zA-Z0-9#/])|(?<=[^a-zA-Z0-9#/])(?=[a-zA-Z0-9#/])");

answered Oct 22 '12 at 14:17

Ωmega

42,614
34
134
203

No, that requires a character in front and a character behind, so it won't match a word boundary at the beginning or end of the string. The OP has the right formula, he just made a slight error with the syntax. – Alan Moore Oct 22 '12 at 14:33
He's trying to create an equivalent for `\b` that conforms to his definition of word characters. His corrected regex works exactly the same as yours when they're used with Scanner's `useDelimiter()` method--which, I admit, I hadn't realized when I wrote my comment. But I think my point is still valid: your answer may solve his problem, but it doesn't answer his question. – Alan Moore Oct 22 '12 at 15:16
@AlanMoore - Read his question again - it says: `I want to split a string...` – Ωmega Oct 22 '12 at 15:25
Okay, the question he *should* have been asking. :P The problem with his own solution was in the syntax, not the semantics. – Alan Moore Oct 22 '12 at 15:41

score 0 · Answer 3 · answered Oct 22 '12 at 14:10

0

what is wrong with:

[^A-Za-z0-9#/]+

in other words any run of at least one character in the set that is not your word set

or if you need the spaces

[^A-Za-z0-9#/ ]+

and then strip the spaces out for special processing after the scanner (if needed)

answered Oct 22 '12 at 14:10

Stephen Connolly

13,872
6
41
63

I need the spaces between the words as well. The scanner would swallow them using your regex. – rolve Oct 22 '12 at 14:12
1

I think OP wants spaces as separate "words"/tokens or whatever we will call it :) – Pshemo Oct 22 '12 at 14:16

How Can I Use Look-Ahead and Look-Behind to Create a Custom Boundary Matcher?

3 Answers3