4

I have a big text file (+100MB), each line being an integer number (containing 10 million numbers). Of course, the size and amount may change, so I don't know this in advance.

I want to load the file into a int[], making the process as fast as posible. First I came to this solution:

public int[] fileToArray(String fileName) throws IOException
{
    List<String> list = Files.readAllLines(Paths.get(fileName));
    int[] res = new int[list.size()];
    int pos = 0;
    for (String line: list)
    {
        res[pos++] = Integer.parseInt(line);
    }
    return res;
}

It was pretty fast, 5.5 seconds. Of which, 5.1s goes for the readAllLines call, and 0.4s for the loop.

But then I decided to try using BufferedReader, and came to this different solution:

public int[] fileToArray(String fileName) throws IOException
{
    BufferedReader bufferedReader = new BufferedReader(new FileReader(new File(fileName)));
    ArrayList<Integer> ints = new ArrayList<Integer>();
    String line;
    while ((line = bufferedReader.readLine()) != null)
    {
        ints.add(Integer.parseInt(line));
    }
    bufferedReader.close();

    int[] res = new int[ints.size()];
    int pos = 0;
    for (Integer i: ints)
    {
        res[pos++] = i.intValue();
    }
    return res;
}

This was even faster! 3.1 seconds, just 3s for the while loop and not even 0.1s for the for loop.

I know there is no much space here for optimization, at least in time, but using an ArrayList and then a int[] seems like too much memory to me.

Any ideas on how to make this faster, or avoid using the middle ArrayList?

Just for comparison, I do this same task with FreePascal in 1.9 seconds [see edit], using TStringList class and StrToInt function.

EDIT: Since I got a pretty short time with Java method, I had to improve the FreePascal one. 330~360ms.

mclopez
  • 151
  • 1
  • 11
  • 1
    Looks like you've gathered some good metrics already. You may want to have a look at https://stackoverflow.com/questions/13155700/fastest-way-to-read-and-write-large-files-line-by-line-in-java – Jameson Aug 21 '16 at 23:13
  • You can try this ArrayList ints = new ArrayList(); Integer[] res = ints.toArray(new Integer[ints.size()]); – ravthiru Aug 21 '16 at 23:17
  • 1
    Could you approximate the number of ints in the file by obtaining the file size? You could pass this to your ArrayList<> as the initial capacity in the constructor and maybe it wouldn't need to grow so many times. – WW. Aug 21 '16 at 23:33
  • @WW. I tried it, makes no measurable difference. – mclopez Aug 21 '16 at 23:49
  • Does your FreePascal use Unicode (Java does internally use a sort of UTF-16, so every char costs two bytes; Java 9 will offer a more compact encoding for latin-1 strings)? In Java, `String`s are first class objects, this has some cost. +++ What is your platform encoding? Using `new InputStreamReader(new FileInputStream(...), encoding)` might gain some speed. – maaartinus Aug 21 '16 at 23:56
  • @maaartinus The class notes for `FileInputStream` says "For reading streams of characters, consider using FileReader." I feel that the JDK might have the best recommendation here, since it uses the platform encoding by default anyway. – 4castle Aug 22 '16 at 00:07
  • @4castle Depending on what you're doing, platform encoding might be what you want. But using it implicitly is prone to forgetting to specify it when needed (e.g., for downloaded files). I'd personally deprecate all such methods (but nobody's is asking me:D). +++ Anyway, `FileReader` is just a trivial shortcut making the typical `BufferedReader+InputStreamReader+FileInputStream` pattern a step less painful. Don't think these classes are worth much, they come from about the same time as `Vector` and other mistakes. +++ My main point was that a different encoding might be faster. – maaartinus Aug 22 '16 at 00:23

1 Answers1

7

If you're using Java 8, you can eliminate this middle ArrayList by using lines() and then mapping to an int, then collecting the values into an array.

You should also be using try-with-resources for proper exception handling and auto-closing.

try (BufferedReader br = new BufferedReader(new FileReader(fileName))) {
    return br.lines()
             .mapToInt(Integer::parseInt)
             .toArray();
}

I'm not sure if this is faster, but it is certainly much easier to maintain.

Edit: It is apparently MUCH faster.

4castle
  • 32,613
  • 11
  • 69
  • 106
  • I'd be interested if @mclopez could give us some performance information on this solution. – WW. Aug 21 '16 at 23:31
  • @mclopez If it helps you in your research, the features being used here are `Stream`s and method references. You should also research lambda expressions while you're at it. – 4castle Aug 22 '16 at 00:15