0

I want to split a String at the word boundaries using Scanner. Normally, this would be done like this:

Scanner scanner = new Scanner(...).useDelimiter("\\b");

The problem is that my definition of "word" character is a tiny bit different from the standard [a-zA-Z_0-9] as I want to include some more characters and exclude the _: [a-zA-Z0-9#/]. Therefore, I can't use the \b pattern.

So I tried to do the same thing using look-ahead and look-behind, but what I came up with didn't work:

(<?=[A-Za-z0-9#/])(?![A-Za-z0-9#/])|(<?![A-Za-z0-9#/])(?=[A-Za-z0-9#/])

The scanner doesn't split anywhere using this.

Is it possible to do this using look-ahead and look-behind and how?

rolve
  • 10,083
  • 4
  • 55
  • 75
  • Just a minor point, but your "standard" definition of `\b` is also wrong. – Mark Byers Oct 22 '12 at 14:07
  • I didn't give one, but I assume it is something like `(?<=\w)(?!\w)|(?<!\w)(?=\w)`. – rolve Oct 22 '12 at 14:26
  • 1
    That's how it's *supposed* to be defined, and if you use Java 7 and its new [UNICODE_CHARACTER_CLASS](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS) mode, it is. But Java's legacy `\b` is a bit more...creative. See [this question](http://stackoverflow.com/q/4304928/20938) for details, especially @tchrist's answer. – Alan Moore Oct 22 '12 at 15:36

3 Answers3

3

There's an error in your syntax. The ? comes first:

(?<=[A-Za-z0-9#/])(?![A-Za-z0-9#/])|(?<![A-Za-z0-9#/])(?=[A-Za-z0-9#/])
 ^^                                  ^^
Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
1
new Scanner(...).useDelimiter(
  "(?<=[a-zA-Z0-9#/])(?=[^a-zA-Z0-9#/])|(?<=[^a-zA-Z0-9#/])(?=[a-zA-Z0-9#/])");
Ωmega
  • 42,614
  • 34
  • 134
  • 203
  • No, that requires a character in front and a character behind, so it won't match a word boundary at the beginning or end of the string. The OP has the right formula, he just made a slight error with the syntax. – Alan Moore Oct 22 '12 at 14:33
  • He's trying to create an equivalent for `\b` that conforms to his definition of word characters. His corrected regex works exactly the same as yours when they're used with Scanner's `useDelimiter()` method--which, I admit, I hadn't realized when I wrote my comment. But I think my point is still valid: your answer may solve his problem, but it doesn't answer his question. – Alan Moore Oct 22 '12 at 15:16
  • @AlanMoore - Read his question again - it says: `I want to split a string...` – Ωmega Oct 22 '12 at 15:25
  • Okay, the question he *should* have been asking. :P The problem with his own solution was in the syntax, not the semantics. – Alan Moore Oct 22 '12 at 15:41
0

what is wrong with:

[^A-Za-z0-9#/]+

in other words any run of at least one character in the set that is not your word set

or if you need the spaces

[^A-Za-z0-9#/ ]+

and then strip the spaces out for special processing after the scanner (if needed)

Stephen Connolly
  • 13,872
  • 6
  • 41
  • 63