1

I need to split a string (in Java) into individual words ... but I need to preserve spaces.

An example of the text I need to split is something like this:
ABC . . . . DEF . . . . GHI

I need to see "ABC", " . . . .", "DEF", ". . . .", and "GHI".

Obviously splitting on the space character \s isn't going to work, as all the spaces get swallowed up as one space.

Any suggestions?

Thanks

David G
  • 3,940
  • 1
  • 22
  • 30
  • 4
    Why do you want ". . . . " and not ". ", ". ", ". ", ". "? You only want to split on space sometimes? What are the rules exactly? – Mark Byers Jun 02 '10 at 19:55
  • Actually, that would have been fine also ... I just needed the spaces preserved. – David G Jun 03 '10 at 13:24

2 Answers2

5

Looks like you can just split on \b in this case ("\\b" as a string literal).

Generally you want to split on zero-width matching constructs, which \b is, but also lookarounds can be used.

Related questions


Splitting based on a custom word boundary

If \b isn't fitting your definition, you can always define your own boundaries using assertions.

For example, the following regex splits on the boundary between a meta character class X and its complement

(?=[X])(?<=[^X])|(?=[^X])(?<=[X])

In the following example, we define X to be \d:

    System.out.println(java.util.Arrays.toString(
        "007james123bond".split(
            "(?=[X])(?<=[^X])|(?=[^X])(?<=[X])".replace("X", "\\d")
        )
    )); // prints "[007, james, 123, bond]"

Here's another example where X is a-z$:

    System.out.println(java.util.Arrays.toString(
        "$dollar . . blah-blah   $more gimme".split(
            "(?=[X])(?<=[^X])|(?=[^X])(?<=[X])".replace("X", "a-z$")
        )
    )); // prints "[$dollar,  . . , blah, -, blah,    , $more,  , gimme]"
Community
  • 1
  • 1
polygenelubricants
  • 376,812
  • 128
  • 561
  • 623
  • Tried that ... but some of the characters in my string are 'non-word' characters (like `&`), so they get eaten up. I need a a character sequence such as `&N` to be considered a word. – David G Jun 02 '10 at 19:57
  • @david: assertions/lookarounds is most definitely the answer here; you need to address Mark Byers' comment. What are the rules exactly? – polygenelubricants Jun 02 '10 at 19:59
  • @david: If you have the time, write up your own answer. I'd like to see what you're trying to do and how you've accomplished it. You may even get {Self Learner}. – polygenelubricants Jun 02 '10 at 20:07
1

Thanks guys, that gave me the lead I needed ... I'm using (?<=[\\s]) and it works exactly the way I want!

David G
  • 3,940
  • 1
  • 22
  • 30