Fast parsing of strings of numbers in java

Question

I have found plenty of different suggestions on how to parse an ASCII file containing double precision numbers into an array of doubles in Java. What I currently use is roughly the following:

stream = FileInputStream(fname);
breader = BufferedReader(InputStreamReader(stream));
scanner = java.util.Scanner(breader);    
array = new double[size]; // size is known upfront
idx = 0;
try {
        while(idx<size){
           array[idx] = scanner.nextDouble();
           idx++;
        }
}
catch {...}

For an example file with 1 million numbers this code takes roughly 2 seconds. Similar code written in C, using fscanf, takes 0.1 second (!) Clearly I got it all wrong. I guess calling nextDouble() so many times is the wrong way to go because of the overhead, but I cannot figure out a better way.

I am no Java expert and hence I need a little help with this: can you tell me how to improve this code?

Edit The corresponding C code follows

  fd = fopen(fname, "r+");
  vals = calloc(sizeof(double), size);
  do{
    nel = fscanf(fd, "%lf", vals+idx);
    idx++;
  } while(nel!=-1);

Did you profile the code to check which parts are the performance bottleneck? — Dragan Bozanovic, Oct 11 '15 at 11:13
@DraganBozanovic No I didn't. I do not have much experience with Java, I used emacs to program this. It is such a small piece of code that the bottleneck should be rather obvious to find, I guess.. It does lie in the while loop :) and I do not suspect it comes from `array[idx]` and memory allocation. — angainor, Oct 11 '15 at 11:16
_"It is such a small piece of code"_ No, it is not. There is lots of code in the libraries you invoke. Nevertheless, I suggest you give it a try with a profiler. Also, keep in mind that Java has JIT, which may consume some time at the beginning when it optimizes the code. Could you try to execute the same snippet of code 100 times for example and check the execution time of the last few iterations? — Dragan Bozanovic, Oct 11 '15 at 11:25
10 iterations take 10 times more time. The time is spent in the loop. — angainor, Oct 11 '15 at 11:46
Please post the "catch" part. Are you sure that there are not 1 million Exceptions thrown in the loop? A `Scanner` is a rather complex thing, using regular expressions and all that - not really intended for ""big data"" ;-) — Marco13, Oct 11 '15 at 11:48
@Marco13 If an exception is thrown, would the loop not be terminated? The catch statement does a `System.out.println("<...>Exception");` and throws again. The loop is terminated naturally, not by exceptions - I made sure this is the case :) — angainor, Oct 11 '15 at 11:51
Just asking, because the code as it is posted now will neither compile nor read doubles from a file (a code snippet that was closer to the "real" one might have avoided misunderstandings here...) — Marco13, Oct 11 '15 at 11:54
Are all the numbers on a single line separated by spaces, or on individual lines? — Andreas, Oct 11 '15 at 12:06
Although the performance statements are somewhat dubious, this is an interesting question. Most idiomatic Java solutions will in the one form or the other rely on some `readLine` method, which is not applicable when the numbers are only separated by spaces. The fastest one that I came up with now used a `StreamTokenizer`, and maybe I'll do more tests and post the results later, but am looking forward to other suggestions (+1) — Marco13, Oct 11 '15 at 12:36
@Andreas The file consists of lines that contain each maximum 4 and minimum 1 doubles, separated by whitespace. — angainor, Oct 11 '15 at 12:55
@Marco13 Well, the performance statement is based on one example only. I see why you wonder about the file structure. There are 1e6 doubles, each looks like `1234.0000`, mostly 4 doubles per line, but could be, say, 8 doubles per line - a limited number. The separation between individual numbers is whitespace, but not necessarily only one space. I could test with another input data, if you are interested. — angainor, Oct 11 '15 at 12:59

score 3 · Accepted Answer · edited May 23 '17 at 11:59

_{(Summarizing some of the things that I already mentioned in the comments:)}

You should be careful with manual benchmarks. The answer to the question How do I write a correct micro-benchmark in Java? points out some of the basic caveats. However, this case is not so prone to the classical pitfalls. In fact, the opposite might be the case: When the benchmark solely consists of reading a file, then you are most likely not benchmarking the code, but mainly the hard disc. This involves the usual side effects of caching.

However, there obviously is an overhead beyond the pure file IO.

You should be aware that the Scanner class is very powerful and convenient. But internally, it is a beast consisting of large regular expressions and hides a tremendous complexity from the user - a complexity that is not necessary at all when your intention is to only read double values!

There are solutions with less overhead.

Unfortunately, the simplest solution is only applicable when the numbers in the input are separated by line separators. Then, reading this file into an array could be written as

double result[] = 
    Files.lines(Paths.get(fileName))
        .mapToDouble(Double::parseDouble)
        .toArray();

and this could even be rather fast. When there are multiple numbers in one line (as you mentioned in the comment), then this could be extended:

double result[] = 
    Files.lines(Paths.get(fileName))
        .flatMap(s -> Stream.of(s.split("\\s+")))
        .mapToDouble(Double::parseDouble)
        .toArray();

So regarding the general question of how to efficiently read a set of double values from a file, separated by whitespaces (but not necessarily separated by newlines), I wrote a small test.

This should not be considered as a real benchmark, and be taken with a grain of salt, but it at least tries to address some basic issues: It reads files with different sizes, multiple times, with different methods, so that for the later runs, the effects of hard disc caching should be the same for all methods:

_{Updated to generate sample data as described in the comment, and added the stream-based approach}

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.StreamTokenizer;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Locale;
import java.util.Random;
import java.util.Scanner;
import java.util.StringTokenizer;
import java.util.stream.Stream;

public class ReadingFileWithDoubles
{
    private static final int MIN_SIZE = 256000;
    private static final int MAX_SIZE = 2048000;

    public static void main(String[] args) throws IOException
    {
        generateFiles();

        long before = 0;
        long after = 0;
        double result[] = null;

        for (int n=MIN_SIZE; n<=MAX_SIZE; n*=2)
        {
            String fileName = "doubles"+n+".txt";

            for (int i=0; i<10; i++)
            {
                before = System.nanoTime();
                result = readWithScanner(fileName, n);
                after = System.nanoTime();

                System.out.println(
                    "size = " + n + 
                    ", readWithScanner                  " + 
                        (after - before) / 1e6 + 
                    ", result " + result);

                before = System.nanoTime();
                result = readWithStreamTokenizer(fileName, n);
                after = System.nanoTime();

                System.out.println(
                    "size = " + n + 
                    ", readWithStreamTokenizer          " + 
                        (after - before) / 1e6 +
                    ", result " + result);

                before = System.nanoTime();
                result = readWithBufferAndStringTokenizer(fileName, n);
                after = System.nanoTime();

                System.out.println(
                    "size = " + n + 
                    ", readWithBufferAndStringTokenizer " + 
                        (after - before) / 1e6 + 
                    ", result " + result);

                before = System.nanoTime();
                result = readWithStream(fileName, n);
                after = System.nanoTime();

                System.out.println(
                    "size = " + n + 
                    ", readWithStream                   " + 
                        (after - before) / 1e6 + 
                    ", result " + result);
            }
        }

    }



    private static double[] readWithScanner(
        String fileName, int size) throws IOException
    {
        try (
            InputStream is = new FileInputStream(fileName);
            InputStreamReader isr = new InputStreamReader(is);
            BufferedReader br = new BufferedReader(isr);
            Scanner scanner = new Scanner(br))
        {
            // Do this to avoid surprises on systems with a different locale!
            scanner.useLocale(Locale.ENGLISH);

            int idx = 0;
            double array[] = new double[size];
            while (idx < size)
            {
                array[idx] = scanner.nextDouble();
                idx++;
            }
            return array;
        }
    }

    private static double[] readWithStreamTokenizer(
        String fileName, int size) throws IOException
    {
        try (
            InputStream is = new FileInputStream(fileName);
            InputStreamReader isr = new InputStreamReader(is);
            BufferedReader br = new BufferedReader(isr))
        {
            StreamTokenizer st = new StreamTokenizer(br);            
            st.resetSyntax();
            st.wordChars('0', '9');
            st.wordChars('.', '.');
            st.wordChars('-', '-');
            st.wordChars('e', 'e');
            st.wordChars('E', 'E');
            double array[] = new double[size];
            int index = 0;
            boolean eof = false;
            do
            {
                int token = st.nextToken();
                switch (token)
                {
                    case StreamTokenizer.TT_EOF:
                        eof = true;
                        break;

                    case StreamTokenizer.TT_WORD:
                        double d = Double.parseDouble(st.sval);
                        array[index++] = d;
                        break;
                }
            } while (!eof);
            return array;
        }
    }

    // This one is reading the whole file into memory, as a String,
    // which may not be appropriate for large files
    private static double[] readWithBufferAndStringTokenizer(
        String fileName, int size) throws IOException
    {
        double array[] = new double[size];
        try (
            InputStream is = new FileInputStream(fileName);
            InputStreamReader isr = new InputStreamReader(is);
            BufferedReader br = new BufferedReader(isr))
        {
            StringBuilder sb = new StringBuilder();
            char buffer[] = new char[1024];
            while (true)
            {
                int n = br.read(buffer);
                if (n == -1)
                {
                    break;
                }
                sb.append(buffer, 0, n);
            }
            int index = 0;
            StringTokenizer st = new StringTokenizer(sb.toString());
            while (st.hasMoreTokens())
            {
                array[index++] = Double.parseDouble(st.nextToken());
            }
            return array;
        }
    }

    private static double[] readWithStream(
        String fileName, int size) throws IOException
    {
        double result[] = 
            Files.lines(Paths.get(fileName))
                .flatMap(s -> Stream.of(s.split("\\s+")))
                .mapToDouble(Double::parseDouble)
                .toArray();
        return result;
    }


    private static void generateFiles() throws IOException 
    {
        for (int n=MIN_SIZE; n<=MAX_SIZE; n*=2)
        {
            String fileName = "doubles"+n+".txt";
            if (!new File(fileName).exists())
            {
                System.out.println("Creating "+fileName);
                writeDoubles(new FileOutputStream(fileName), n);
            }
            else
            {
                System.out.println("File "+fileName+" already exists");
            }
        }
    }
    private static void writeDoubles(OutputStream os, int n) throws IOException
    {
        OutputStreamWriter writer = new OutputStreamWriter(os);
        Random random = new Random(0);
        int numbersPerLine = random.nextInt(4) + 1;
        for (int i=0; i<n; i++)
        {
            writer.write(String.valueOf(random.nextDouble()));
            numbersPerLine--;
            if (numbersPerLine == 0)
            {
                writer.write("\n");
                numbersPerLine = random.nextInt(4) + 1;
            }
            else
            {
                writer.write(" ");
            }
        }
        writer.close();
    }
}

It compares 4 methods:

Reading with a Scanner, as in your original code snippet
Reading with a StreamTokenizer
Reading the whole file into a String, and dissecting it with a StringTokenizer
Reading the file as a Stream of lines, which are then flat-mapped to a Stream of tokens, which are then mapped to a DoubleStream

Reading the file as one large String may not be appropriate in all cases: When the files become (much) larger, then keeping the whole file in memory as a String may not be a viable solution.

A test run (on a rather old PC, with a slow hard disc drive (no solid state)) showed roughly these results:

...
size = 1024000, readWithScanner                  9932.940919, result [D@1c7353a
size = 1024000, readWithStreamTokenizer          1187.051427, result [D@1a9515
size = 1024000, readWithBufferAndStringTokenizer 1172.235019, result [D@f49f1c
size = 1024000, readWithStream                   2197.785473, result [D@1469ea2    ...

Obviously, the scanner imposes a considerable overhead that may be avoided when reading more directly from the stream.

This may not be the final answer, as there may be more efficient and/or more elegant solutions (and I'm looking forward to see them!), but maybe it is helpful at least.

EDIT

A small remark: There is a certain conceptual difference between the approaches in general. Roughly speaking, the difference lies in who determines the number of elements that are read. In pseudocode, this difference is

double array[] = new double[size];
for (int i=0; i<size; i++) 
{
    array[i] = readDoubleFromInput();
}

versus

double array[] = new double[size];
int index = 0;
while (thereAreStillNumbersInTheInput())
{
    double d = readDoubleFromInput();
    array[index++] = d;
}

Your original approach with the scanner was written like the first one, while the solutions that I proposed are more similar to the second. But this should not make a large difference here, assuming that the size is indeed the real size, and potential errors (like too few or too many numbers in the input) don't appear or are handled in some other way.

Thank you very much, Marco13 :) This looks really really good! I will play with it and see, how it works on my data, but the last solution seems to give the 'C' results. I am thinking, can you not modify it so that you use the StringTokenizer on a buffer of finite size, and put that into a loop? In this way you would work on larger buffers and still use the StringTokenizer. Anyway, I was sure I do things wrong, thank you again for your help :-) — angainor, Oct 11 '15 at 13:50
@angainor There certainly are hybrid solutions, like reading the file in "chunks" (and not a large, single buffer) and then processing these chunks. A more detailed analysis may be worthwhile, but they might become somewhat more complicated - e.g. one would have to make sure that the chunks to not cut apart numbers, as in `0.123 0.234 0.3` (cut) `45 0.456` - this can be handled, of course, but may be a bit fiddly. BTW: I'll add a small (more general) remark as an *EDIT* in a few minutes... — Marco13, Oct 11 '15 at 13:57
@Marco13: just curious. Can we just tokenize the stream directly into double numbers instead of going a round trip to parse string into double? — dragon66, Oct 11 '15 at 14:46
@dragon66 Theoretically, one *could* read each individual character, and process it like it is done in `FloatingDecimal.readJavaFormatString` with the characters of a `String`. However, it is usually a bad idea to read a stream character-by-character: Some buffering should usually be done (i.e. one should read a certain number of chars into a `char[]`, at least), and this would then be very close to a `String` anyhow. It's hard to predict any performance benefits, but my gut feeling is that the effort-to-performance-gain-ratio would be rather bad... — Marco13, Oct 11 '15 at 14:56
@Marco13: I mean in your code you use case StreamTokenizer.TT_WORD to grab the next token as a string. Why not just use case StreamTokenizer.TT_NUMBER? — dragon66, Oct 11 '15 at 15:01
Works really nice. I ended up with the third solution, but reading/tokenizing line by line instead of the whole file at once. Thank you! — angainor, Oct 11 '15 at 15:10
@dragon66 The `StreamTokenizer` class does not properly handle floating point numbers in scientific notation (see [this bug report](http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4079180)), so a number like `1.23e-5` would be parsed as a number `1.23` and a word `e-5` - manually fixing this would be a hassle. — Marco13, Oct 11 '15 at 15:54

Fast parsing of strings of numbers in java

1 Answers1