21

I understand that using a BufferedReader (wrapping a FileReader) is going to be significantly slower than using a BufferedInputStream (wrapping a FileInputStream), because the raw bytes have to be converted to characters. But I don't understand why it is so much slower! Here are the two code samples that I'm using:

BufferedInputStream inputStream = new BufferedInputStream(new FileInputStream(filename));
try {
  byte[] byteBuffer = new byte[bufferSize];
  int numberOfBytes;
  do {
    numberOfBytes = inputStream.read(byteBuffer, 0, bufferSize);
  } while (numberOfBytes >= 0);
}
finally {
  inputStream.close();
}

and:

BufferedReader reader = new BufferedReader(new FileReader(filename), bufferSize);
try {
  char[] charBuffer = new char[bufferSize];
  int numberOfChars;
  do {
    numberOfChars = reader.read(charBuffer, 0, bufferSize);
  } while (numberOfChars >= 0);
}
finally {
  reader.close();
}

I've tried tests using various buffer sizes, all with a 150 megabyte file. Here are the results (buffer size is in bytes; times are in milliseconds):

Buffer   Input
  Size  Stream  Reader
 4,096    145     497
 8,192    125     465
16,384     95     515
32,768     74     506
65,536     64     531

As can be seen, the fastest time for the BufferedInputStream (64 ms) is seven times faster than the fastest time for the BufferedReader (465 ms). As I stated above, I don't have an issue with a significant difference; but this much difference just seems unreasonable.

My question is: does anyone have a suggestion for how to improve the performance of the BufferedReader, or an alternative mechanism?

Andy King
  • 1,632
  • 2
  • 20
  • 29
  • 5
    I think the most likely explanation is that your benchmark is flawed; e.g. you are not taking proper account of JVM warmup effects. Please post the complete thing. – Stephen C Jan 13 '13 at 06:17
  • @StephenC or maybe disk cache? – John Dvorak Jan 13 '13 at 06:22
  • 4
    You're comparing apples and oranges--the second test involves converting bytes to `char`, which the first doesn't do. If you need `char` data, use a `Reader`; if you need bytes, use an `InputStream`. I think you'll find that the fastest of all will be a `BufferedReader` wrapping an `InputStreamReader` wrapping a `BufferedInputStream` wrapping a `FileInputStream`. Also take a look at [this thread](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) on how to write a benchmark. – Ted Hopp Jan 13 '13 at 06:24
  • The result may also depend on the character encoding that is used. – Henry Jan 13 '13 at 06:33
  • @StephenC I am not suggesting that my "benchmark" is very scientific, but I don't think the difference is a result of JVM startup, GC execution, or anything of that sort ... I ran the code in loops, and took the average time over a much larger sample; also, both tests were run in the same JVM (it happens that the BufferedInputStream is executed first, but that doesn't seem to be important). Please explain why you think the timings are flawed. – Andy King Jan 13 '13 at 06:46
  • 2
    Without seeing your actual code, I can't give you an full explanation. But my main reasons for thinking this are 1) the times you are reporting seem implausible to me, and 2) you haven't responded to the JVM warmup theory ... which suggests that you don't understand its significance. Just post the code ... so that we can see what you are actually doing, and try to reproduce it. – Stephen C Jan 13 '13 at 06:48
  • @Jan Dvorak Even if there is disk caching involved, I don't think this has any relevance ... as I have stated in a previous comment, the code for the BufferedInputStream runs in the same execution run as the code for the BufferedReader. I don't actually think that the 150MB file is being cached, but perhaps it is ... yet how does this explain the difference in time between the character and byte processing? – Andy King Jan 13 '13 at 06:48
  • @TedHopp Yes, as I tried to explain in my question, I understand that there is a significant difference between processing raw bytes and characters. It just seems that a seven-fold difference in performance is more than I would expect. And I have a feeling that your suggestion to wrap the FileInputStream in three layers is not serious ... if it is, just let me know and I'll try it! – Andy King Jan 13 '13 at 06:51
  • The suggestion was perfectly serious. I did some experiments some time ago and was surprised at the results. It's a second-order improvement, but seemed to be definitely there. – Ted Hopp Jan 13 '13 at 07:28
  • Is the buffer size the same in both cases? In bytes rather than in absolute value? Are you running both tests in the same JVM? And if so, in which order? Have you tried different size arguments when constructing the BufferedInputStream/Reader? – user207421 Jan 13 '13 at 07:37
  • @EJP The magnitude of the size of the buffers was the same, but consequently not the physical size ... the char array is using the size of a char (I think that's four bytes on my machine) an the byte array is using a byte. This may explain the differences in the relative speeds when the buffer size changes. Both tests are executed in the same method in the same execution of the program (and my test harness runs them more than just once). – Andy King Jan 13 '13 at 08:23
  • @StephenC Thank you for your comments ... why do you think that the times are implausible? And why do you think this may be affected by JVM warmup? The tests are running in the same JVM, in the same execution of the program, an the faster test is executed first (I would expect that JVM warmup would cause the earlier test to be slower). When I reverse the order of the tests I see no difference in the times. The only reason that I hesitate to post the code is that it is a small part of a larger program ... I could isolate it an post that, I suppose. You could try the code that I've posted. – Andy King Jan 13 '13 at 08:29
  • @TedHopp I tried the BufferedReader+InputStreamReader+BufferedInputStream+FileInputStream approach, and the results were within a few milliseconds of the simple BufferedReader+FileReader test (for both small and large buffers). – Andy King Jan 13 '13 at 08:41
  • Hm. I guess I need to revisit my testing. – Ted Hopp Jan 13 '13 at 09:10

2 Answers2

15

The BufferedReader has convert the bytes into chars. This byte by byte parsing and copy to a larger type is expensive relative to a straight copy of blocks of data.

byte[] bytes = new byte[150 * 1024 * 1024];
Arrays.fill(bytes, (byte) '\n');

for (int i = 0; i < 10; i++) {
    long start = System.nanoTime();
    StandardCharsets.UTF_8.decode(ByteBuffer.wrap(bytes));
    long time = System.nanoTime() - start;
    System.out.printf("Time to decode %,d MB was %,d ms%n",
            bytes.length / 1024 / 1024, time / 1000000);
}

prints

Time to decode 150 MB was 226 ms
Time to decode 150 MB was 167 ms

NOTE: Having to do this intermixed with system calls can slow down both operations (as system calls can disturb the cache)

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
3

in BufferedReader implementation there is a fixed constant defaultExpectedLineLength = 80, which is used in readLine method when allocating StringBuffer. If you have big file with lots of lines longer then 80, this fragment might be something that can be improved

if (s == null) 
    s = new StringBuffer(defaultExpectedLineLength);
s.append(cb, startChar, i - startChar);
unknown
  • 4,859
  • 10
  • 44
  • 62
Jakub C
  • 595
  • 1
  • 5
  • 14