7

I did a question about punctuation and regex, but it was confusing.

Supossing I have this text:

String text = "wor.d1, :word2. wo,rd3? word4!"; 

I'm doing this:

String parts[] = text.split(" ");

And I have this:

wor.d1, | :word2. | wor,d3? | word4!;

What I need to do to have this? (Keep the the symbols at the borders, but only that I specify: .,!?:, not all).

wor,d1 | , | : | word2 | . | wor,d3 | ? | word4 | !

UPDATE

I'm getting some good results with these regex, but it's giving an empty char before all splits on punctuation at start of a word.

There is a way to not have this empty char at the start?

Is this regex is good, or there is a more simple way?

public static final String PUNCTUATION_SEPARATOR =
        "("
        + "("
        + "(?=^[\"'!?.,;:(){}\\[\\]]+)"
        + "|"
        + "(?<=^[\"'!?.,;:(){}\\[\\]]+)"
        + ")"
        + "|"
        + "("
        + "(?=[\"'!?.,;:(){}\\[\\]]+($|\n))"
        + "|"
        + "(?<=[\"'!?.,;:(){}\\[\\]]+($|\n))"
        + ")"
        + ")";
Renato Dinhani
  • 35,057
  • 55
  • 139
  • 199
  • See this question: http://stackoverflow.com/questions/275768/is-there-a-way-to-split-strings-with-string-split-and-include-the-delimiters – JJ. Aug 19 '11 at 21:05

5 Answers5

2

Are you sure you want to use regex ? There's a faster implementation for splitting by single char: StringTokenizer. And it that can return the delimiters.

String str= "word1, word2. word3? word4!";
String delim = ",.!?";
StringTokenizer st = new StringTokenizer(str, delim, true);
while (st.hasMoreTokens()) {
  String token = st.nextToken();
  ... // token will be: "word1", ",", " word2", ".", etc...
}
m_vitaly
  • 11,856
  • 5
  • 47
  • 63
  • This works, but needs to be regex or something more complex, because I only want to spit on the borders (start and end) and not in the middle. – Renato Dinhani Aug 19 '11 at 21:30
  • I mean if the symbol is in the middle of the String (a-b, 20.50), I don't want to splitted, only in the borders (test, [100, etc.). – Renato Dinhani Aug 19 '11 at 21:58
1

For simple separators I recommend the StringTokenizer. But here's a solution using regex and another auxiliary separator:

String s  = "one,two, three   four ,  five";
s = s.replaceAll("([,\\s]+)", "#$1#");
Pattern p = Pattern.compile("#");
String[] result = p.split(s);
mradu
  • 54
  • 4
1

Here's a regex that I think will work:

/\s|(?=[\.,:?!](\W|$))|(?<=\W[\.:?!])/
Jacob Eggers
  • 9,062
  • 2
  • 25
  • 43
0

In my opinion you want this. First you explode your string and second step you use implode function.

Yusuf ali
  • 331
  • 1
  • 6
  • 15
0
public static final String PUNCTUATION_SEPARATOR =
    "("
    + "("
    + "(?=^[\"'!?.,;:(){}\\[\\]-]+)"
    + "|"
    + "(?<=^[\"'!?.,;:(){}\\[\\]-]+)"
    + ")"
    + "|"
    + "("
    + "(?=[\"'!?.,;:(){}\\[\\]-]+($|\n))"
    + "|"
    + "(?<=[\"'!?.,;:(){}\\[\\]-]+($|\n))"
    + ")"
    + ")";
Renato Dinhani
  • 35,057
  • 55
  • 139
  • 199