2

I have seen that the syntax for passing multiple delimiters (eg. '.' , '?', '!') to the StringTokenizer constructor is:

StringTokenizer obj=new StringTokenizer(str,".?!");

What I am not getting is that, I have enclosed all the delimiters together in double quotes, so does that not make it a String rather than individual characters. How does the StringTokenizer class identify them as separate characters? Why is ".?!" not treated as a single delimiter?

Andrew Tobilko
  • 48,120
  • 14
  • 91
  • 142

4 Answers4

3

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code.

So forget about it.

It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

So use String#split instead.

String[] elements = str.split("\\.\\?!"); // treats ".?!" as a single delimiter
String[] elements2 = str.split("[.?!]"); // three delimiters 

If you miss StringTokenizer's Enumeration nature, get an Iterator.

Iterator<String> iterator = Arrays.asList(elements).iterator();
while (iterator.hasNext()) {
  String next = iterator.next();
  // ...
}

How does the StringTokenizer class identify them as separate characters?

It's an implementation detail and it shouldn't be your concern. There are a couple of ways to do that. They use String#charAt(int) and String#codePointAt(int).

Why is ".?!" not treated as a single delimiter?

That's the choice they've made: "We will take a String and we will be looking for delimeters there." The Javadoc makes it clear.

 *
 * @param   str            a string to be parsed.
 * @param   delim          the delimiters.
 * @param   returnDelims   flag indicating whether to return the delimiters
 *                         as tokens.
 * @exception NullPointerException if str is <CODE>null</CODE>
 */
public StringTokenizer(String str, String delim, boolean returnDelims) {
Andrew Tobilko
  • 48,120
  • 14
  • 91
  • 142
1

That's just how StringTokenizer is defined. Just take a look at the javadoc

Constructs a string tokenizer for the specified string. All characters in the delim argument are the delimiters for separating tokens.

Also in source code you will find delimiterCodePoints field described as following

/**
 * When hasSurrogates is true, delimiters are converted to code
 * points and isDelimiter(int) is used to determine if the given
 * codepoint is a delimiter.
 */
private int[] delimiterCodePoints;

so basically each of delimiters character is being converted to the int code stored in the array - the array is then used to decide whether the character is delimiter or not

m.antkowicz
  • 13,268
  • 18
  • 37
  • I see..so can delimiters be Strings rather than single characters? –  Sep 05 '19 at 16:34
  • then you need to use something else than StringTokenizer - take a look at this topic: https://stackoverflow.com/questions/12215598/equivalent-to-stringtokenizer-with-multiple-characters-delimiters – m.antkowicz Sep 05 '19 at 16:37
  • I get it, so StringTokenizer only recognizes characters as delimiters. Actually I could have used the Split[] function, but our school syllabus restricts its use. Anyway, that cleared my confusion. –  Sep 05 '19 at 16:44
0

It's true that you pass a single string rather than individual characters, but what is done with that string is up to the StringTokenizer. The StringTokenizer takes each character from your delimiter string and uses each one as a delimiter. This way, you can split the string on multiple different delimiters without having to run the tokenizer more than once.

You can see the documentation for this function here where it states:

The characters in the delim argument are the delimiters for separating tokens.

If you don't pass anything in for this parameter, it defaults to " \t\n\r\f", which is basically just whitespace.

Holden Lewis
  • 387
  • 3
  • 18
0

How does the StringTokenizer class identify them as separate characters?

There is a method in String called charAt and codePointAt, which returns the character or code point at an index:

"abc".charAt(0) // 'a'

The StringTokenizer's implementation will use it both of these methods on the delimiters passed in at some point. In my version of the JDK, the code points of the delimiters string are extracted and added to an array delimiterCodePoints in a method called setMaxDelimCodePoint, which is called by the constructor:

private void setMaxDelimCodePoint() { // ...

    if (hasSurrogates) {
        delimiterCodePoints = new int[count];
        for (int i = 0, j = 0; i < count; i++, j += Character.charCount(c)) {
            c = delimiters.codePointAt(j); <--- notice this line
            delimiterCodePoints[i] = c;
        }
    }
}

And then this array is accessed in the isDelimiter method, which decides whether a character is a delimiter:

private boolean isDelimiter(int codePoint) {
    for (int i = 0; i < delimiterCodePoints.length; i++) {
        if (delimiterCodePoints[i] == codePoint) {
            return true;
        }
    }
    return false;
}

Of course, this is not the only way that the API could be designed. The constructor could have accepted an array of char as delimiters instead, but I am not qualified to say why the designers did it this way.

Why is ".?!" not treated as a single delimiter?

StringTokenizer only supports single character delimiters. If you want a string as a delimiter, you can use Scanner or String.split instead. For both of these, the delimiter is represented as a regular expression, so you have to use "\\.\\?!" instead. You can learn more about regular expressions here

Sweeper
  • 213,210
  • 22
  • 193
  • 313