2

Is it possible to build a regexp for use with Javas Pattern.split(..) method to reproduce the StringTokenizer("...", "...", true) behaveiour?

So that the input is split to an alternating sequence of the predefined token characters and any abitrary strings running between them.

The JRE reference states for StringTokenizer it should be considered deprecated and String.split(..) could be used instead way. So it is considered possible there.

The reason I want to use split is that regular expressions are often highly optimized. The StringTokenizer for example is quite slow on the Android Platforms VM, while regex patterns are executed by optimized native code there it seems.

dronus
  • 10,774
  • 8
  • 54
  • 80
  • possible duplicate of [Is there a way to split strings with String.split() and include the delimiters?](http://stackoverflow.com/questions/275768/is-there-a-way-to-split-strings-with-string-split-and-include-the-delimiters) – CoolBeans May 08 '11 at 18:56
  • There is a uncommented "Code Challange" with the same idea, but no answer it seems. I do not want to include the delimiters, but fetch them as distinct tokens. – dronus May 08 '11 at 19:04
  • Maybe there should be a "I am pedantic, answer question exactly as asked" flag :-) – dronus May 08 '11 at 20:15

3 Answers3

1

Considering that the documentation for split doesn't specify this behavior and has only one optional parameter that tells how large the array should be.. no you can't.

Also looking at the only other class I can think of that could have this feature - a scanner - it doesn't either. So I think the easiest would be to continue using the Tokenizer, even if it's deprecated. Better than writing your own class - while that shouldn't be too hard (quite trivial really) I can think of better ways to spend ones time.

Voo
  • 29,040
  • 11
  • 82
  • 156
  • But `String.split()` takes an abitrary regular expression and it is not obvious to me why it should not be possible with a smart expression? – dronus May 08 '11 at 19:12
  • +1 for recommending to use the proper tool for the job. StringTokenizer is not depricated and does exactly what you want. Don't force String.split(...) to attempt to do something it wasn't designed for. Even if you can get it to work, nobody will actually understand the regex used. Keep it simple. Did you look at the link provided by CoolBeans above? The code is horrendous to try and do something that is easily done by the StringTokenizer. – camickr May 08 '11 at 19:12
  • Currently I like to use `Pattern.split(..)` on the Android platform, as the VM is rather slow there and the implementation of `StringTokenizer` is not very efficient. On the other hand, regex'es are implemented natively on the platform and quite fast, so `Pattern.split(..)` is. – dronus May 08 '11 at 19:17
1

a regex Pattern can help you

Patter p = Pattern.compile("(.*?)(\\s*)");
//put the boundary regex in between the second brackets (where the \\s* now is)
Matcher m = p.matcher(string);
int endindex=0;
while(m.find(endindex)){
//m.group(1) is the part between the pattern
//m.group(2) is the match found of the pattern
endindex = m.end();
}
//then the remainder of the string is string.substring(endindex);
ratchet freak
  • 47,288
  • 5
  • 68
  • 106
1
import java.util.List;
import java.util.LinkedList;
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Splitter {


public Splitter(String s, String delimiters) {
    this.string = s;
    this.delimiters = delimiters;
    Pattern pattern = Pattern.compile(delimiters);
    this.matcher = pattern.matcher(string);
}

public String[] split() {
    String[] strs = string.split(delimiters);
    String[] delims = delimiters();
    if (strs.length == 0) { return new String[0];}
    assert(strs.length == delims.length + 1);
    List<String> output = new LinkedList<String>();
    int i;
    for(i = 0;i < delims.length;i++) {
        output.add(strs[i]);
        output.add(delims[i]);
    }
    output.add(strs[i]);
    return output.toArray(new String[0]);
}

private String[] delimiters() {
    List<String> delims = new LinkedList<String>();
    while(matcher.find()) {
        delims.add(string.subSequence(matcher.start(), matcher.end()).toString());
    }
    return delims.toArray(new String[0]);
}

public static void main(String[] args) {
    Splitter s = new Splitter("a b\tc", "[ \t]");
    String[] tokensanddelims = s.split();
    assert(tokensanddelims.length == 5);
    System.out.print(tokensanddelims[0].equals("a"));
    System.out.print(tokensanddelims[1].equals(" "));
    System.out.print(tokensanddelims[2].equals("b"));
    System.out.print(tokensanddelims[3].equals("\t"));
    System.out.print(tokensanddelims[4].equals("c"));
}


private Matcher matcher;
private String string;
private String delimiters;
}
dronus
  • 10,774
  • 8
  • 54
  • 80
Lyn Headley
  • 11,368
  • 3
  • 33
  • 35
  • Well, seems cool. However it separasizes tokens from delimiters what is not needed in my case. I like to replace the `StringTokenizer`s behaviour with alternating delimiter / token sequence output. – dronus May 08 '11 at 20:07
  • I added the missing `import` statement. Works fine. It doesn't replace `StringTokenizer` by something more performant however. I was in hope that a single RegExp for use with `split` could do the job as a single RegExp is handled natively fast on the Android platform. – dronus May 09 '11 at 21:39