11

For the sake of this question, let's assume I have a String which contains the values Two;.Three;.Four (and so on) but the elements are separated by ;..

Now I know there are multiple ways of splitting a string such as split() and StringTokenizer (being the faster one and works well) but my input file is around 1GB and I am looking for something slightly more efficient than StringTokenizer.

After some research, I found that indexOf and substring are quite efficient but the examples only have single delimiters or results are returning only a single word/element.

Sample code using indexOf and substring:

String s = "quick,brown,fox,jumps,over,the,lazy,dog";
int from = s.indexOf(',');
int to = s.indexOf(',', from+1);
String brown = s.substring(from+1, to);

The above works for printing brown but how can I use indexOf and substring to split a line with multiple delimiters and display all the items as below.

Expected output

Two
Three
Four
....and so on
Patrick
  • 1,728
  • 2
  • 17
  • 30
user92038111111
  • 191
  • 2
  • 2
  • 9

3 Answers3

7

This is the method I use for splitting large (1GB+) tab-separated files. It is limited to a char delimiter to avoid any overhead of additional method invocations (which may be optimized out by the runtime), but it can be easily converted to String-delimited. I'd be interested if anyone can come up with a faster method or improvements on this method.

public static String[] split(final String line, final char delimiter)
{
    CharSequence[] temp = new CharSequence[(line.length() / 2) + 1];
    int wordCount = 0;
    int i = 0;
    int j = line.indexOf(delimiter, 0); // first substring

    while (j >= 0)
    {
        temp[wordCount++] = line.substring(i, j);
        i = j + 1;
        j = line.indexOf(delimiter, i); // rest of substrings
    }

    temp[wordCount++] = line.substring(i); // last substring

    String[] result = new String[wordCount];
    System.arraycopy(temp, 0, result, 0, wordCount);

    return result;
}
Parker
  • 7,244
  • 12
  • 70
  • 92
  • You can further improve this by obtaining all the indexes at once, as indexOf loops through the String – Sport Feb 19 '21 at 13:46
  • @Sport Inside the loop, I start each search after the index of the previous occurrence (`line.indexOf(delimiter, i)`), so each character is only checked once. I could probably write an inline version of `indexOf(char, int)` to avoid the overhead of repeated method invocation. – Parker Feb 19 '21 at 14:23
5

If you want the ultimate in efficiency I wouldn't use Strings at all, let alone split them. I would do what compilers do: process the file a character at a time. Use a BufferedReader with a large buffer size, say 128kb, and read a char at a time, accumulating them into say a StringBuilder until you get a ; or line terminator.

user207421
  • 305,947
  • 44
  • 307
  • 483
4

StringTokenizer is faster than StringBuilder.

public static void main(String[] args) {

    String str = "This is String , split by StringTokenizer, created by me";
    StringTokenizer st = new StringTokenizer(str);

    System.out.println("---- Split by space ------");
    while (st.hasMoreElements()) {
        System.out.println(st.nextElement());
    }

    System.out.println("---- Split by comma ',' ------");
    StringTokenizer st2 = new StringTokenizer(str, ",");

    while (st2.hasMoreElements()) {
        System.out.println(st2.nextElement());
    }
}
user92038111111
  • 191
  • 2
  • 2
  • 9
  • 3
    According to [JDK Docs](https://docs.oracle.com/javase/6/docs/api/java/util/StringTokenizer.html), `StringTokenizer` is considered a Legacy class for a while now. The recommendation is to use `String.split` or something from `java.util.regex` package. – Yonathan W'Gebriel Jun 01 '21 at 00:46