0

I'm looking for an efficient way to obtain a list of String tokens extracted from multiple Strings (e.g. with a whitespace separator).

Example:

String s1 = "My mom cook everyday";
String s2 = "I eat everyday";
String s3 = "Am I fat?";  
LinkedList<String> tokens = new LinkedList<String>();   
//any code to efficiently get the tokens

//final result is tokens  make of a list of the following tokens:
//"My", "mom", "cook", "everyday", "I", "eat", "everyday", "Am", "I", "fat?".

Now

  1. I'm not sure that LinkedList is the most effective collection class to be used (Apache Commons, Guava, may they help?)!
  2. I was going to use StringUtils from Apache Commons, but the split method returns an array! So, I should extract with a for cycle the Strings from the array of String objects returned by split. Is that efficient: I don't know, split creates an array!
  3. I read about Splitter from Guava, but this post states that StringUtils is better in practice.
  4. What about Scanner from Java.util. It seems to not allocate any additional data structures. Isn't it?

Please, draw the most efficient Java solution, even by using additional widely used library, like Guava and Apache Commons.

Sotirios Delimanolis
  • 274,122
  • 60
  • 696
  • 724
mat_boy
  • 12,998
  • 22
  • 72
  • 116
  • 1
    About #3 - the post you're citing states the opposite: *In conclusion I think I'll still use Splitter most of the time. On small lists the difference in performance is going to be negligible, and Splitter just feels much nicer to use. Still I was surprised by the result, and if you're splitting lots of Strings and performance is an issue, it might be worth considering switching back to Commons StringUtils.* Plus, Splitter is much, **much** more powerful than String#split or Apache Commons solution. – Grzegorz Rożniecki Apr 09 '13 at 14:57
  • 1
    Why so much interest in optimizing? – Paul Vargas Apr 09 '13 at 14:59
  • @PaulVargas Because I have thousands strings to be tokenized from large texts. – mat_boy Apr 09 '13 at 15:01
  • @Xaerxess I read the post: it doesn't state the opposite. He concludes that "_if you're splitting lots of Strings and performance is an issue, it might be worth considering switching back to Commons StringUtils_". I'm aiming to this... – mat_boy Apr 09 '13 at 15:04
  • Did you mean read large files of text? – Paul Vargas Apr 09 '13 at 15:07
  • Author of the blog post does two benchmarks: in first Guava wins (is 10x faster), in second is 4x slower. Does 100ms vs 400ms make a difference for you? Is there really a sensible and measured performance problem? If not, see @PaulVargas comments. And if you have thousands of strings to tokenize **and** operate on (search etc.) use some text search engine like [Lucene](http://lucene.apache.org/core/). – Grzegorz Rożniecki Apr 09 '13 at 15:18
  • I'm far from convinced by that author's benchmarking methodology, too... – Louis Wasserman Apr 09 '13 at 15:22
  • @mat_boy: There are a lot of issues not taken into account, like GC and JIT compilation. Java microbenchmarking is _hard_, and an accurate benchmark isn't the sort of thing you can whip up from scratch in an hour. – Louis Wasserman Apr 09 '13 at 15:32

5 Answers5

5
for (String str : Arrays.asList(s1, s2, s3)) {
  Iterables.addAll(tokens, Splitter.on(' ').split(str));
}

would be the way I'd do it. That said, ArrayList is preferable to LinkedList for almost all use cases; without further data, we really can't tell whether or not you're in one of those rare cases where LinkedList is preferable.

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
4

If you have small Strings and performance isn't an issue, you can just combine split with addAll like this:

String s1 = "My mom cook everyday";
String s2 = "I eat everyday";
String s3 = "Am I fat?";  
List<String> tokens = new ArrayList<String>();  

tokens.addAll(Arrays.asList(s1.split("\\s+")));
tokens.addAll(Arrays.asList(s2.split("\\s+")));
tokens.addAll(Arrays.asList(s3.split("\\s+")));

System.out.println(tokens);

However if performance is an issue here's an alternative solution:

Since there is no definition in how these long texts are acquired, I'll assume they come in an InputStream. See if this method is performatic enough to fit your needs:

public List<String> readTokens(InputStream is) throws IOException{
    Reader reader = new InputStreamReader(is);
    List<String> tokens = new ArrayList<String>();
    BufferedReader bufferedReader = new BufferedReader(reader);
    String line = null;
    while((line = bufferedReader.readLine()) != null){
        String[] lineTokens = StringUtils.split(line, " "); 
        for(int i = 0 ; i < lineTokens.length ; i++){
            tokens.add(lineTokens[i]);
        }
    }
    return tokens;
}

And as to your statement regarding ArrayList vs LinkedList for inserting at the end, perhaps you should read this

Community
  • 1
  • 1
Rodrigo Sasaki
  • 7,048
  • 4
  • 34
  • 49
  • I read that split from String is not a really effective solution. This is why, for instance, Apache Commons comes with `StringUtils.split`! Maybe because it use patterns... – mat_boy Apr 09 '13 at 14:53
  • Are your Strings really that large? Because you shouldn't worry about performance where it isn't an issue – Rodrigo Sasaki Apr 09 '13 at 14:56
  • 1
    Moreover, ArrayList is not efficient as LinkedList when the number of elements to be inserted is high! In fact, ArrayList is just a "wrapper" for an array, it comes with a default sized array and once you exceed the default size you have to create a new array and to copy the old values into the newer! Very uneffective! – mat_boy Apr 09 '13 at 14:57
  • Uneffective is a very subjective term. You need to state your needs of performance and your requirements. My answer was based on your example. And for what you have stated, my answer is very effective. Perhaps a more real example would help us draw a better solution – Rodrigo Sasaki Apr 09 '13 at 15:01
  • @mat_boy, I edited my answer, see if this new answer fits your needs better – Rodrigo Sasaki Apr 09 '13 at 15:09
  • Mhhh! Your second answer was better! However, I have to say that I'm doing a lot of tests, and LinkedList seems to provide better insertion time. Do a test at your own and let me know! – mat_boy Apr 09 '13 at 15:36
  • @mat_boy If you're so concerned about ArrayList and / or LinkedList and don't need mutability, you should consider `Arrays.asList` wrapper (wraps an array, has fixed length) or Guava's `ImmutableList` (immutable, [proved to be very efficient](https://code.google.com/p/memory-measurer/wiki/ElementCostInDataStructures)) or `Iterables.addAll` (see @LouisWasserman's answer). – Grzegorz Rożniecki Apr 09 '13 at 15:38
  • 1
    @mat_boy: "Very uneffective!" It may sound inefficient, but in practice, it's frequently still more efficient than `LinkedList`. It's still linear time overall, and copying one array to another array is actually blazing fast when it's done with tools like `memcpy` (which is how Java does it under the covers). The extra indirections involved with `LinkedList` can be expensive. – Louis Wasserman Apr 09 '13 at 15:43
  • @Xaerxess I'm doing a lot of tests. The version with `String.split() `and LinkedList is the faster when only few string and tokens must be allocated into the list. The version with StringUtils needs less execution time compared with the Guava based solution (e.g. see @LouisWasserman).But this is not explanatory of what will happen at runtime, I'm only proving that if I need to load a lot of data at once, then StringUtils is better. Now! I'm really confused about the best answer! – mat_boy Apr 09 '13 at 15:47
  • Just check whichever one you're actually going to use (if any), both answers here were good for your problem. – Rodrigo Sasaki Apr 09 '13 at 15:52
  • I want to thank you both [Rodrigo Sasaki](http://stackoverflow.com/users/1316000/rodrigo-sasaki) and [Louis Wasserman](http://stackoverflow.com/users/869736/louis-wasserman) for their answers. I will accept the first one because it seems to be fastes! I also invite the author to edit the the answer to add the second solution based on String.split, because it is faster with small tokens. – mat_boy Apr 09 '13 at 16:05
  • @RodrigoSasaki Thank you! Now, both cases are covered. Thanks. About performance of LinkedList vs ArrayList, [here](http://stackoverflow.com/questions/322715/when-to-use-linkedlist-over-arraylist) people can find a nice Q/A. As I stated, the insertion is O(n) for the ArrayList in the worst case, while it is always O(1) for the LinkedList. Hence, in my case LinkedList is better. This is also true if we consider that I then only need to iterate the whole list, and not to do random access. Regards. – mat_boy Apr 10 '13 at 07:34
  • @mat_boy, I think you should consider other things here. If all you care about is overall time, odds are that the ArrayList will be faster. it taking O(n) time doesn't mean that it'll be slower, because when you get to the point that your array needs to grow, you already paid the performance cost in all those quick inserts, and besides, you can set your array to start at a high value, size of 1000 for example. – Rodrigo Sasaki Apr 10 '13 at 22:28
  • @mat_boy LinkedList's O(1) is very often slower than ArrayList's (amortized) O(n) for small arrays. See [this benchmark](http://1.microbenchmarks.appspot.com/run/wasserman.louis@gmail.com/edu.uchicago.lowasser.fingertree.ListBenchmark/1207004). You can also tune ArrayList's capacity while creating it. – Grzegorz Rożniecki Apr 12 '13 at 09:41
  • @Xaerxess Thank you for the link. However, I did some test with the Rodigo Sasaki answer and for small arrays the execution time provided by Guava Stopwatch was better when using LinkedList. Moreover, I can set the ArrayList size by default when I don't know the number of tokens that will be put inside. – mat_boy Apr 12 '13 at 09:57
0

or just Arrays.asList((s1 + " " + s2 + " " + s3).split("\\s+"))

AlexR
  • 114,158
  • 16
  • 130
  • 208
0

First join your strings using your separator (see Join a string using delimiters). Then:

 LinkedList<String> tokens = new LinkedList<String>();
 StringTokenizer st = new StringTokenizer(yourstr); // " " as a default delimiter
 while (st.hasMoreTokens()) {
     tokens.add(st.nextToken());
 }

Are you looking for an efficient or performant solution (i.e. what is your constraints/reference performance)?

Community
  • 1
  • 1
metaphori
  • 2,681
  • 1
  • 21
  • 32
0
     import java.util.ArrayList;
     import java.util.Collections;


    public class stringintotoken {
String s="my name is tarun bharti";
ArrayList <String> words=new ArrayList<String>();
public static void main(String[] args)
{
    stringintotoken st=new stringintotoken();
    st.go();
}
public void go()
{
    wordlist();
    System.out.println(words);
    Collections.sort(words);
    System.out.println(words);

}
public void wordlist()
{
    String[] tokens=s.split(" ");
    for(int i=0;i<tokens.length;i++)
    {
    words.add(tokens[i]);
    }
}

}

Tarun Bharti
  • 185
  • 3
  • 17