1

Take a look at the following link:

http://snippetsofjosh.wordpress.com/tag/advantages-and-disadvantages-of-arraylist/

This one of the reasons why I always prefer to use Arrays instead of (Array)Lists. Still, this got me thinking about memory management and speed.

Hence I arrived at the following question:

What is the best way to store data from a file when you don't know the size of the file (/number of entries) (where best is defined as 'the least amount of computation time')

Below, I will present 3 different methods and I would like to know which one of them is best and why. For the clarity of the question, let's assume I must end up with an Array. Also, let's assume every line from our .txt file only has one entry (/one string). Also, for limiting the scope of the questions, I will limit this question to Java only.

Let's say we want to retrieve the following info from a file called words.txt:

Hello
I 
am
a
test 
file

Method 1 - Double and dangerous

File read = new File("words.txt");
Scanner in = new Scanner(read);

int counter = 0;

while (in.hasNextLine())
{
    in.nextLine();
    counter++;
}

String[] data = new String[counter];

in = new Scanner(read);

int i = 0;

while (in.hasNextLine())
{
    data[i] = in.nextLine();
    i++;
}

Method 2 - Clear but redundant

File read = new File("words.txt");
Scanner in = new Scanner(read);

ArrayList<String> temporary = new ArrayList<String>();

while (in.hasNextLine())
{
    temporary.add(in.nextLine());
}

String[] data = new String[temporary.size()];

for (int i = 0; i < temporary.size(); i++)
{
    data[i] = temporary.get(i);
}

Method 3 - Short but rigid

File read = new File("words.txt");
FileReader reader = new FileReader(read);

String content = null;

char[] chars = new char[(int) read.length()];
reader.read(chars);
content = new String(chars);

String[] data = content.split(System.getProperty("line.separator"));

reader.close(); 

If you have an alternative way (which is even better) then please supply it below. Also, feel free to adjust my code where necessary.


Answer:

The fastest method for storing data in an array is the following method:

File read = new File("words.txt");
Scanner in = new Scanner(read);

ArrayList<String> temporary = new ArrayList<String>();

while (in.hasNextLine()) {
    temporary.add(in.nextLine());
}

String[] data = temporary.toArray(new String[temporary.size()]);

And for Java 7+:

Path loc = Paths.get(URI.create("file:///Users/joe/FileTest.txt"));
List<String> lines = Files.readAllLines(loc, Charset.defaultCharset());
String[] array = lines.toArray(new String[lines.size()]);
Community
  • 1
  • 1
Jean-Paul
  • 19,910
  • 9
  • 62
  • 88
  • 1
    Hmmm... The boxing argument only applies to primitives, which is not a strong enough reason to avoid lists. Afer all, you know when you are using primitive arrays, so this is not somethign you are not aware of. And ease of maintainece is better than performance gains where it doesn't matter. – Devolus Jun 26 '13 at 12:17
  • @Devolus: Seems reasonable. So you your advice is to 'just use ArrayLists?' – Jean-Paul Jun 26 '13 at 12:20
  • 1
    Use what is best for your application design and worry about performance where it matter.s – Devolus Jun 26 '13 at 12:22
  • 2
    @Jean-Paul Unless you are dealing with primitives or you are in a critical section of your program, there is little advantage of using arrays. In particular, your code will spend 99% of its time (if not more) reading the file and only < 1% working on the array or list operations. See also: http://stackoverflow.com/a/16565376/829571 – assylias Jun 26 '13 at 12:46
  • 1
    Actually your answer isn't the fastest. A `LinkedList` would be faster than an `ArrayList` for appending. – user207421 Apr 08 '15 at 10:29
  • @EJP: Interesting, could you post that as an answer below + run the benchmark LesMiserables.txt as assylias did? – Jean-Paul Apr 08 '15 at 10:34
  • Hi Jean-Paul. Please don't edit an "answer" into the question. You can create your own answer to the question by clicking the "Answer" link, and then accept it if you believe it is the best. – DJClayworth Sep 10 '15 at 21:41
  • 1
    Ask a question. Answer it yourself. Get silently downvoted because someone is in a bad mood and has rep. SO culture is interesting. – sixtytrees Aug 09 '16 at 20:07

5 Answers5

3

I assume that best means faster here.

I would use method 2, but create the array with the methods provided by the Collection interface:

String[] array = temporary.toArray(new String[temporary.size()]);

Or even simpler (Java 7+):

List<String> lines = Files.readAllLines(file, charset);
String[] array = lines.toArray(new String[lines.size()]);

Other methods:

  • method 1 does two passes and it is very unlikely that reading a file is more efficient than resizing an arraylist
  • I am not sure if method 3 is faster or not

Update:

for the sake of completeness, I have run a microbenchmark with the modified method2 as above and including an additional method (method4) that reads all bytes at once, creates a string and split on new lines. The results (in mn microseconds):

Benchmark   Mean 
method1     126.178
method2     59.679
method3     76.622
method4     75.293

Edit:

with a larger 3MB file: LesMiserables.txt, the results are consistent:

Benchmark      Mean 
method1     608649.322
method2      34167.101
method3      63410.496
method4      65552.79
assylias
  • 321,522
  • 82
  • 660
  • 783
  • So now the array will take the size of the arraylist hence this will extinguish the need for the second loop. Very cool! But still the question remains: is this the fastest method? – Jean-Paul Jun 26 '13 at 12:19
  • @Jean-Paul, did you look at the source? – Steve P. Jun 26 '13 at 12:25
  • @SteveP.: I did. It says: `Returns an array containing all of the elements in this collection; the runtime type of the returned Array is that of the specified array.` So I figured that somehow the returned array must also loop over the elements in the ArrayList hence it is sort of the same as method 2? – Jean-Paul Jun 26 '13 at 12:30
  • @Jean-Paul `toArray()` calls `arrayCopy()`, a `native` method, so I guess it depends, but this is not going to be any sort of bottleneck on your program. I understand wanting to be as efficient as possible, but in this case, I wouldn't worry about it too much. The way assylias has written above seems like the way you should do it. It's easy and simple. – Steve P. Jun 26 '13 at 12:34
  • @assylias. Thank you! This almost completely answers my question: can you also do it for a very large file, to see what will happen with the times? I just created one for you of 3MB: http://www.speedyshare.com/xvAGq/LesMiserables.txt Please let me know if you need a bigger file. – Jean-Paul Jun 26 '13 at 12:46
  • @Jean-Paul Sorry but I won't install an executable from an unknown website. I will test with random data. – assylias Jun 26 '13 at 12:51
  • @assylias: It's not an executable: that's the link at the bottom. You get the file by clicking at the link at the top. Anyway, I've just added it to my dropbox. So this should work now as well: https://www.dropbox.com/s/2rsxgpz6orz6c2f/LesMiserables.txt – Jean-Paul Jun 26 '13 at 12:56
  • 1
    @Jean-Paul Same results. Also, if you don't *need* an array just use the result of `Files.readAllLines` and don't bother converting into an array (which will save a few hundreds of nanoseconds). – assylias Jun 26 '13 at 13:04
  • @assylias: Awesome! This clearly proves method 2 is the best. Thank you for your time and effort! – Jean-Paul Jun 26 '13 at 13:06
3

A very good comparison with all the source code is given here java_tip_how_read_files_quickly

Summary:

For the best Java read performance, there are four things to remember:

  • Minimize I/O operations by reading an array at a time, not a byte at a time. An 8Kbyte array is a good size.
  • Minimize method calls by getting data an array at a time, not a byte at a time. Use array indexing to get at bytes in the array.
  • Minimize thread synchronization locks if you don't need thread safety. Either make fewer method calls to a thread-safe class, or use a non-thread-safe class like FileChannel and MappedByteBuffer.
  • Minimize data copying between the JVM/OS, internal buffers, and application arrays. Use FileChannel with memory mapping, or a direct or wrapped array ByteBuffer.

Hope that helps.

EDIT

I would do sth like that:

File read = new File("words.txt");
Scanner in = new Scanner(read);    
List<String> temporary = new LinkedList<String>();

while (in.hasNextLine()) {
    temporary.add(in.nextLine());
}

String[] data = temporary.toArray(new String[temporary.size()]);

The main difference is reading data only once (as opposed to other 2 methods) and addition in linkedlist is very cheap + no extra operation on lines needed (like splitting) - don't use arraylist here

Tala
  • 8,888
  • 5
  • 34
  • 38
  • It sure does! Very very interesting! One of the conclusions is: minimize data copying; this indicates to method 1... So what do you think? Which method is the best according to the article? – Jean-Paul Jun 26 '13 at 12:26
2

If you are reading data from a file, the bottleneck will be the file reading (IO) stage. The time spent processing it will be insignificant in almost all cases. So do what is correct and safe. First you make it right; then you make it fast.

If you don't know the size of the file you must have some kind of dynamically expanding data structure. Which is what ArrayList is. Code you write yourself is unlikely to be more eficient or correct than such an important part of the Java API. So just use ArrayList: option 2.

Raedwald
  • 46,613
  • 43
  • 151
  • 237
1

I would use guava

File file = new File("words.txt");
List<String> lines = Files.readLines(file, Charset.defaultCharset());
// If it really has to be an array:
String[] array = lines.toArray(new String[0]);
jlordo
  • 37,490
  • 6
  • 58
  • 83
1
List<String> lines = Files.readAllLines(yourFile, charset);
String[] arr = lines.toArray(new String[lines.size()]);
Ruchira Gayan Ranaweera
  • 34,993
  • 17
  • 75
  • 115