(Summarizing some of the things that I already mentioned in the comments:)
You should be careful with manual benchmarks. The answer to the question How do I write a correct micro-benchmark in Java? points out some of the basic caveats. However, this case is not so prone to the classical pitfalls. In fact, the opposite might be the case: When the benchmark solely consists of reading a file, then you are most likely not benchmarking the code, but mainly the hard disc. This involves the usual side effects of caching.
However, there obviously is an overhead beyond the pure file IO.
You should be aware that the Scanner
class is very powerful and convenient. But internally, it is a beast consisting of large regular expressions and hides a tremendous complexity from the user - a complexity that is not necessary at all when your intention is to only read double
values!
There are solutions with less overhead.
Unfortunately, the simplest solution is only applicable when the numbers in the input are separated by line separators. Then, reading this file into an array could be written as
double result[] =
Files.lines(Paths.get(fileName))
.mapToDouble(Double::parseDouble)
.toArray();
and this could even be rather fast. When there are multiple numbers in one line (as you mentioned in the comment), then this could be extended:
double result[] =
Files.lines(Paths.get(fileName))
.flatMap(s -> Stream.of(s.split("\\s+")))
.mapToDouble(Double::parseDouble)
.toArray();
So regarding the general question of how to efficiently read a set of double
values from a file, separated by whitespaces (but not necessarily separated by newlines), I wrote a small test.
This should not be considered as a real benchmark, and be taken with a grain of salt, but it at least tries to address some basic issues: It reads files with different sizes, multiple times, with different methods, so that for the later runs, the effects of hard disc caching should be the same for all methods:
Updated to generate sample data as described in the comment, and added the stream-based approach
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.StreamTokenizer;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Locale;
import java.util.Random;
import java.util.Scanner;
import java.util.StringTokenizer;
import java.util.stream.Stream;
public class ReadingFileWithDoubles
{
private static final int MIN_SIZE = 256000;
private static final int MAX_SIZE = 2048000;
public static void main(String[] args) throws IOException
{
generateFiles();
long before = 0;
long after = 0;
double result[] = null;
for (int n=MIN_SIZE; n<=MAX_SIZE; n*=2)
{
String fileName = "doubles"+n+".txt";
for (int i=0; i<10; i++)
{
before = System.nanoTime();
result = readWithScanner(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithScanner " +
(after - before) / 1e6 +
", result " + result);
before = System.nanoTime();
result = readWithStreamTokenizer(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithStreamTokenizer " +
(after - before) / 1e6 +
", result " + result);
before = System.nanoTime();
result = readWithBufferAndStringTokenizer(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithBufferAndStringTokenizer " +
(after - before) / 1e6 +
", result " + result);
before = System.nanoTime();
result = readWithStream(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithStream " +
(after - before) / 1e6 +
", result " + result);
}
}
}
private static double[] readWithScanner(
String fileName, int size) throws IOException
{
try (
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
Scanner scanner = new Scanner(br))
{
// Do this to avoid surprises on systems with a different locale!
scanner.useLocale(Locale.ENGLISH);
int idx = 0;
double array[] = new double[size];
while (idx < size)
{
array[idx] = scanner.nextDouble();
idx++;
}
return array;
}
}
private static double[] readWithStreamTokenizer(
String fileName, int size) throws IOException
{
try (
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr))
{
StreamTokenizer st = new StreamTokenizer(br);
st.resetSyntax();
st.wordChars('0', '9');
st.wordChars('.', '.');
st.wordChars('-', '-');
st.wordChars('e', 'e');
st.wordChars('E', 'E');
double array[] = new double[size];
int index = 0;
boolean eof = false;
do
{
int token = st.nextToken();
switch (token)
{
case StreamTokenizer.TT_EOF:
eof = true;
break;
case StreamTokenizer.TT_WORD:
double d = Double.parseDouble(st.sval);
array[index++] = d;
break;
}
} while (!eof);
return array;
}
}
// This one is reading the whole file into memory, as a String,
// which may not be appropriate for large files
private static double[] readWithBufferAndStringTokenizer(
String fileName, int size) throws IOException
{
double array[] = new double[size];
try (
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr))
{
StringBuilder sb = new StringBuilder();
char buffer[] = new char[1024];
while (true)
{
int n = br.read(buffer);
if (n == -1)
{
break;
}
sb.append(buffer, 0, n);
}
int index = 0;
StringTokenizer st = new StringTokenizer(sb.toString());
while (st.hasMoreTokens())
{
array[index++] = Double.parseDouble(st.nextToken());
}
return array;
}
}
private static double[] readWithStream(
String fileName, int size) throws IOException
{
double result[] =
Files.lines(Paths.get(fileName))
.flatMap(s -> Stream.of(s.split("\\s+")))
.mapToDouble(Double::parseDouble)
.toArray();
return result;
}
private static void generateFiles() throws IOException
{
for (int n=MIN_SIZE; n<=MAX_SIZE; n*=2)
{
String fileName = "doubles"+n+".txt";
if (!new File(fileName).exists())
{
System.out.println("Creating "+fileName);
writeDoubles(new FileOutputStream(fileName), n);
}
else
{
System.out.println("File "+fileName+" already exists");
}
}
}
private static void writeDoubles(OutputStream os, int n) throws IOException
{
OutputStreamWriter writer = new OutputStreamWriter(os);
Random random = new Random(0);
int numbersPerLine = random.nextInt(4) + 1;
for (int i=0; i<n; i++)
{
writer.write(String.valueOf(random.nextDouble()));
numbersPerLine--;
if (numbersPerLine == 0)
{
writer.write("\n");
numbersPerLine = random.nextInt(4) + 1;
}
else
{
writer.write(" ");
}
}
writer.close();
}
}
It compares 4 methods:
- Reading with a
Scanner
, as in your original code snippet
- Reading with a
StreamTokenizer
- Reading the whole file into a
String
, and dissecting it with a StringTokenizer
- Reading the file as a
Stream
of lines, which are then flat-mapped to a Stream
of tokens, which are then mapped to a DoubleStream
Reading the file as one large String
may not be appropriate in all cases: When the files become (much) larger, then keeping the whole file in memory as a String
may not be a viable solution.
A test run (on a rather old PC, with a slow hard disc drive (no solid state)) showed roughly these results:
...
size = 1024000, readWithScanner 9932.940919, result [D@1c7353a
size = 1024000, readWithStreamTokenizer 1187.051427, result [D@1a9515
size = 1024000, readWithBufferAndStringTokenizer 1172.235019, result [D@f49f1c
size = 1024000, readWithStream 2197.785473, result [D@1469ea2 ...
Obviously, the scanner imposes a considerable overhead that may be avoided when reading more directly from the stream.
This may not be the final answer, as there may be more efficient and/or more elegant solutions (and I'm looking forward to see them!), but maybe it is helpful at least.
EDIT
A small remark: There is a certain conceptual difference between the approaches in general. Roughly speaking, the difference lies in who determines the number of elements that are read. In pseudocode, this difference is
double array[] = new double[size];
for (int i=0; i<size; i++)
{
array[i] = readDoubleFromInput();
}
versus
double array[] = new double[size];
int index = 0;
while (thereAreStillNumbersInTheInput())
{
double d = readDoubleFromInput();
array[index++] = d;
}
Your original approach with the scanner was written like the first one, while the solutions that I proposed are more similar to the second. But this should not make a large difference here, assuming that the size
is indeed the real size, and potential errors (like too few or too many numbers in the input) don't appear or are handled in some other way.