2

i need to read a file from java which has 500,000+ lines and I was wondering whether there's anyway to speed the process up compared to my code:

    Scanner s1 = new Scanner(new FileInputStream(args[0]));
    while(s1.hasNextLine()) {
        temp += s1.nextLine() + "\n";
    }
    data = temp.split("\\s+");

It's fine at the start but after 200000 lines

temp += s1.nextLine() + "\n"

does end up taking a while. The final format I need is a string array of every word.

user2864154
  • 455
  • 2
  • 6
  • 15
  • 5
    StringBuilder is a better choice here for string append at least. – Juned Ahsan Jun 19 '14 at 02:32
  • This is only a guess, but the fact that String is a immutable object is likely the reason for this slow down (garbage collection is taking place). [`StringBuilder`](http://docs.oracle.com/javase/7/docs/api/java/lang/StringBuilder.html) or [`StringBuffer`](http://docs.oracle.com/javase/7/docs/api/java/lang/StringBuffer.html) would be a better choice and you should initialize it with a size close to what you think the final `String` is going to be. – Jared Jun 19 '14 at 02:33
  • 3
    If you want an array of every word, why are you appending in the first place? Simply read word by word and insert in array. – Tanmay Patil Jun 19 '14 at 02:47

2 Answers2

2

The reason for temp += s1.nextLine() + "\n" taking a long time is that you are generating a lot of strings. In fact, for N characters read, you are generating O(N) large strings, and copying O(N^2) characters.

The solution to (just) that would be to append to a StringBuilder instead of using String concatenation. However, that's not the real solution here, because the temp string is not your ultimate goal. Your ultimate goal is to create an array of words.

What you really need to do is to split each line into words, and accumulate the words. But accumulating them directly into an array won't work well ... because arrays cannot be extended. So what I recommend is that you do the following:

  1. create an ArrayList<String> to hold all of the words
  2. read and split each line into an array of words
  3. append the words in the array to the list of all words
  4. when you are finished, use List.toArray to produce the final array of words ... or maybe just leave the words in the list, if that is more appropriate.

The final format I need is a string array of every word.

I read this above as meaning that you want a list of all of the words in the file. If a word appears multiple times in the file, it should appear multiple times in the list.

On the other hand, if you want a list of the distinct words in the file, then you should use a Set rather than a List to accumulate the words. Depending on what you want to do with the words next, HashSet, TreeSet or LinkedHashSet would be appropriate.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
0

Is it each word you are interested in? Or each line? Further, do you want the array to hold a string of each word or a string of each line..? Either way, as Stephen said, an ArrayList is a much nicer approach.

You could:

ArrayList<String> list = new ArrayList<>();

// each line as a string..
while (yourScanner.hasNextLine())
{
    list.add(yourScanner.nextLine());
}

// each word as a string..
while (yourScanner.hasNext())
{
    list.add(yourScanner.next());
}

String concatenation can be expensive, especially up at 200,00 words using a 'temp' variable solution...