0

Looking for some regex help. I'm looking for a method in Java to split up some input text by words, but also keep the delimiters (whitespace, punctuation). Another way to put it would be to split words into their own indexes and other non-word characters could be in other indexes of the array.

This input text:

"Hello, this isn't working!"

Should be put into an array like this:

{"Hello", ",", "this", "isn't", "working", "!"}

or

{"Hello", ", ", "this", " ", "isn't", " ", "working", "!"}

I've done basically the same thing in Python using this:

def split_input(string):
    return re.findall(r"[\w']+|[\s.,!?;:-]", string)

But I've yet to find a way to accomplish the same thing in Java. I've tried String.split() with lookahead/lookbehind and I've tried pattern matchers but haven't had much luck.

Any help would be much appreciated!

Brigham
  • 14,395
  • 3
  • 38
  • 48
kin3tik
  • 325
  • 1
  • 5
  • 13

4 Answers4

5

split is not the Java analog to Python's findall. Matcher.find is.

Pattern stuff = Pattern.compile("[\\w']+|[\\s.,!?;:-]");
Matcher matcher = stuff.matcher("Hello, this isn't working!");
List<String> matchList = new ArrayList<String>();
while (matcher.find()) {
    matchList.add(matcher.group(0)); // add match to the list
}
Brigham
  • 14,395
  • 3
  • 38
  • 48
  • Ah, I did have a go using Matcher but didn't get too far. This seems to do the job quite well though, thank you! – kin3tik Apr 08 '13 at 13:09
1

Try this : It is exactly what you wanted.

public static void main(String[] args) {
    String str = "Hello, this isn't working!";
    String[] s = str.split("(?<=\\s+|,\\s)");
    System.out.println(Arrays.toString(s));
}

Output:

[Hello, , this , isn't , working!]
Achintya Jha
  • 12,735
  • 2
  • 27
  • 39
0

So, putting aside your strange example, here is something that should suit your needs (yet to be tested):

"(?=[\\w']+|[\\s.,!?;:-])"

For the first version.

"(?=[\\w']+|[\\s.,!?;:-]+)"

To keep several delimiters as a whole.

The whole idea being, as you want to split but keep all the characters, to match positions only.

Loamhoof
  • 8,293
  • 27
  • 30
0

Maybe not the best way to do this but you can try :

string.replaceAll("([\\s.,!?;:-])", "$1\n");
string.split("\n");
zakinster
  • 10,508
  • 1
  • 41
  • 52