15

I have to read a 53 MB file character by character. When I do it in C++ using ifstream, it is completed in milliseconds but using Java InputStream it takes several minutes. Is it normal for Java to be this slow or am I missing something?

Also, I need to complete the program in Java (it uses servlets from which I have to call the functions which process these characters). I was thinking maybe writing the file processing part in C or C++ and then using Java Native Interface to interface these functions with my Java programs... How is this idea?

Can anyone give me any other tip... I seriously need to read the file faster. I tried using buffered input, but still it is not giving performance even close to C++.

Edited: My code spans several files and it is very dirty so I am giving the synopsis

import java.io.*;

public class tmp {
    public static void main(String args[]) {
        try{
        InputStream file = new BufferedInputStream(new FileInputStream("1.2.fasta"));
        char ch;        
        while(file.available()!=0) {
            ch = (char)file.read();
                    /* Do processing */
            }
        System.out.println("DONE");
        file.close();
        }catch(Exception e){}
    }
}
pflz
  • 1,891
  • 4
  • 26
  • 32
  • 1
    Show us your code. We can't guess your problem without seeing how you are doing things. – Guillaume Polet May 06 '12 at 20:11
  • 1
    Are you using `BufferedInputStream`? You should use that over `BufferedReader`. Are your access patterns such that you can memory map portions of the file using `java.nio`? Specifically, when you say "`char` by `char`", do you know enough about the encoding to deal with `char`s whose byte sequences might spread across multiple memory mapped segments? – Mike Samuel May 06 '12 at 20:11
  • 1
    There's no way just reading those 53M chars and not doing anything else could take more than a couple of seconds, buffering or no buffering. There's surely something else. – Marko Topolnik May 06 '12 at 20:12
  • Maybe your buffer array size is too small or too big – Luiggi Mendoza May 06 '12 at 20:12
  • @MikeSamuel Yes I used BufferedInputStream as well.. – pflz May 06 '12 at 20:16
  • @MarkoTopolnik I thought that there was problem with something different. But I tried to read the file doing nothing else using InputStream and still it took 2 minutes – pflz May 06 '12 at 20:16
  • Can you please post your code as @GuillaumePolet asked. – Krrose27 May 06 '12 at 20:17
  • Indeed, I'm testing right now on OS X, the performance is about a MB per second -- a minute for your file. That's with a raw `FileInputStream`. But as soon as I wrap in a `BufferedInputStream`, performance rockets to 183 MB in 10 seconds -- 20 MB/s. Note that you cannot cast a byte into a char like that, except if you are reading a pure ASCII stream. – Marko Topolnik May 06 '12 at 20:24
  • @MarkoTopolnik I need it to execute way faster like in C++. Is there no way except creating the program in C++? – pflz May 06 '12 at 20:27
  • 2
    Reading character by character is probably your problem right there. – Louis Wasserman May 06 '12 at 20:28
  • Yes its a ASCII stream. Ok, I am going to try BufferedInputStream. – pflz May 06 '12 at 20:29
  • Are you sure you are actually reading characters in your C++ program, or did your compiler erased useless code? – M Platvoet May 06 '12 at 20:31
  • Now I tried by using the `read(byte[])` method, using a 1000-byte array. Performance was 340 ms for 183 MB, so for your case it would be around 100 ms. – Marko Topolnik May 06 '12 at 20:31
  • @MarkoTopolnik In the code posted, I have used a BufferedInputStream object... is it the same as what you are doing for speedup? – pflz May 06 '12 at 20:32
  • No, using a `BufferedInputStream` wrapper or not buys you quite little compared to invoking `read(int)` vs. `read(byte[])`. – Marko Topolnik May 06 '12 at 20:33
  • 6
    You are using `file.available()` incorrectly. Try this, `while((ch = (char)file.read()) >= 0)` and remove `ch = (char)file.read();` – user845279 May 06 '12 at 20:41
  • FYI this title says large files. If you're files are larger than Integer.MAX - 8 bytes (~3.2GB) you'll get an integer overflow resulting in a NegativeArraySizeException https://bugs.openjdk.java.net/browse/JDK-7129312 – zdsbs Jan 09 '14 at 21:01

4 Answers4

17

I ran this code with a 183 MB file. It printed "Elapsed 250 ms".

final InputStream in = new BufferedInputStream(new FileInputStream("file.txt"));
final long start = System.currentTimeMillis();
int cnt = 0;
final byte[] buf = new byte[1000];
while (in.read(buf) != -1) cnt++;
in.close();
System.out.println("Elapsed " + (System.currentTimeMillis() - start) + " ms");
Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436
  • Nice. Also, I need to process the file character by character. So instead of reading individual characters from the file, I will retrieve it from the buffer and if it runs out, fill it again. Thanks a lot :) – pflz May 06 '12 at 20:42
  • Yes, I think Java gets bogged down on method dispatch whereas C++ maybe even inlines the calls. Sometimes after enough calls HotSpot inlines the calls as well, but I can't be sure for this case. – Marko Topolnik May 06 '12 at 20:44
  • 3
    @MarkoTopolnik There is no evidence here that Java is getting 'bogged down' on anything except calling InputStream.available() 53 million times, which is 53 million redundant system calls. As he is using a BufferedInputStream the number of system calls to actually read the file is 53/8192 million, so calling available() is an immense overhead. – user207421 May 06 '12 at 23:49
  • @EJP I was talking about my own tests. It's in the comments to the question. Changing from `BufferedInputStream.read()` to `read(byte[1000])` on either the raw or the buffered stream gives a 50-fold speed boost. – Marko Topolnik May 07 '12 at 05:40
  • 1
    @MarkoTopolnik I get 676ms with `BufferedInputStream.read(byte[])`; 2441ms with `FileInputStream.read(byte[])`, and 610ms via `BufferedInputStream.read()` (one byte at a time). Different files with random contents (to avoid caching), all 183MB, 1000 byte buffer. Java 6. Results are pretty consistent over several runs. No sign of a 50x problem. – user207421 May 07 '12 at 06:04
  • @EJP ...and your OS is...? I think that's key here. – Marko Topolnik May 07 '12 at 08:42
  • 2
    @EJP OK, so we might be talking about shitty implementation on OS X. After I read your result, I double-checked mine. It's real. – Marko Topolnik May 08 '12 at 07:55
3

I would try this

// create the file so we have something to read.
final String fileName = "1.2.fasta";
FileOutputStream fos = new FileOutputStream(fileName);
fos.write(new byte[54 * 1024 * 1024]);
fos.close();

// read the file in one hit.
long start = System.nanoTime();
FileChannel fc = new FileInputStream(fileName).getChannel();
ByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
while (bb.remaining() > 0)
    bb.getLong();
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds to read %.1f MB%n", time / 1e9, fc.size() / 1e6);
fc.close();
((DirectBuffer) bb).cleaner().clean();

prints

Took 0.016 seconds to read 56.6 MB
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • DirectBuffer symbol was not found. So I removed the last line but running it threw a java.nio.BufferUnderflowException. (53.4 MB file) – pflz May 06 '12 at 21:00
  • 1
    I am reading 8 bytes at a time to speed it up which is no good if the length is not multiple of 8 in length. You can use `bb.get()` instead. DirectBuffer is in `sun.nio.ch` which makes it an internal use API which can be dropped. – Peter Lawrey May 06 '12 at 21:12
2

Use a BufferedInputStream:

InputStream buffy = new BufferedInputStream(inputStream);
Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • 2
    I used a BufferedInputStream as well. InputStream fh = new BufferedInputStream(new FileInputStream("file")); – pflz May 06 '12 at 20:18
1

As noted above, use a BufferedInputStream. You could also use the NIO package. Note that for most files, BufferedInputStream will be just as fast reading as NIO. However, for extremely large files, NIO may do better because you can memory mapped file operations. Furthermore, the NIO package does interruptible IO, whereas the java.io package does not. That means if you want to cancel the operation from another thread, you have to use NIO to make it reliable.

ByteBuffer buf = ByteBuffer.allocate(BUF_SIZE);
FileChannel fileChannel = fileInputStream.getChannel();
int readCount = 0;
while ( (readCount = fileChannel.read(buf)) > 0) {
  buf.flip();
  while (buf.hasRemaining()) {
    byte b = buf.get();
  }
  buf.clear();
}
Matt
  • 11,523
  • 2
  • 23
  • 33
  • 1
    I don't think memory-mapped files are any benefit for sequential reading. – Marko Topolnik May 06 '12 at 20:38
  • @MarkoTopolnik They're not much more than a 20% time benefit for anything, but I don't know why you think sequential reading is a special case. It isn't. Disks still do read-ahead, just as they do when you are using a stream or a reader. – user207421 May 07 '12 at 00:11
  • @EJP Yeah, but read-ahead is done anyway, at a lower level (even within the disk electronics, and also in the disk cache implementation). – Marko Topolnik May 07 '12 at 05:37
  • @MarkoTopolnik *Why* aren't MM files a benefit for sequential reading? – user207421 May 07 '12 at 05:44
  • 1
    @EJP MM files are primarily a convenience for random-access reading because they give a simple API to accessing the file as if it were an array in RAM. If all you do is run through an MM file top to bottom, you're just going to stress the memory manager and not receive any benefit from the paradigm. – Marko Topolnik May 07 '12 at 07:02
  • @EJP I'm all ears for your killer argument, though. Hopefully with code to play with! – Marko Topolnik May 07 '12 at 07:38