4

I've got wrapper for BufferedReader that reads in files one after the other to create an uninterrupted stream across multiple files:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.util.ArrayList;
import java.util.zip.GZIPInputStream;

/**
 * reads in a whole bunch of files such that when one ends it moves to the
 * next file.
 * 
 * @author isaak
 *
 */
class LogFileStream implements FileStreamInterface{
    private ArrayList<String> fileNames;
    private BufferedReader br;
    private boolean done = false;

    /**
    * 
    * @param files an array list of files to read from, order matters.
    * @throws IOException
    */
    public LogFileStream(ArrayList<String> files) throws IOException {
        fileNames = new ArrayList<String>();
        for (int i = 0; i < files.size(); i++) {
            fileNames.add(files.get(i));
        }
        setFile();
    }

    /**
     * advances the file that this class is reading from.
     * 
     * @throws IOException
     */
    private void setFile() throws IOException {
        if (fileNames.size() == 0) {
            this.done = true;
            return;
        }
        if (br != null) {
            br.close();
        }
        //if the file is a .gz file do a little extra work.
        //otherwise read it in with a standard file Reader
        //in either case, set the buffer size to 128kb
        if (fileNames.get(0).endsWith(".gz")) {
            InputStream fileStream = new FileInputStream(fileNames.get(0));
            InputStream gzipStream = new GZIPInputStream(fileStream);
            // TODO this probably needs to be modified to work well on any
            // platform, UTF-8 is standard for debian/novastar though.
            Reader decoder = new InputStreamReader(gzipStream, "UTF-8");
            // note that the buffer size is set to 128kb instead of the standard
            // 8kb.
            br = new BufferedReader(decoder, 131072);
            fileNames.remove(0);
        } else {
            FileReader filereader = new FileReader(fileNames.get(0));
            br = new BufferedReader(filereader, 131072);
            fileNames.remove(0);
        }
    }

    /**
     * returns true if there are more lines available to read.
     * @return true if there are more lines available to read.
     */
    public boolean hasMore() {
        return !done;
    }

    /**
      * Gets the next line from the correct file.
      * @return the next line from the files, if there isn't one it returns null
      * @throws IOException
      */
    public String nextLine() throws IOException {
        if (done == true) {
            return null;
        }
        String line = br.readLine();
        if (line == null) {
            setFile();
            return nextLine();
        }
        return line;
    }
}

If I construct this object on a large list of files (300MB worth of files), then print nextLine() over and over again in a while loop performance continually degrades until there is no more RAM to use. This happens even if I'm reading in files that are ~500kb and using a virtual machine that has 32MB of memory.

I want this code to be able to run on positively massive data-sets (hundreds of gigabytes worth of files) and it is a component of a program that needs to run with 32MB or less of memory.

The files that are used are mostly labeled CSV files, hence the use of Gzip to compress them on disk. This reader needs to handle gzip and uncompressed files.

Correct me if I'm wrong, but once a file has been read through and had its lines spat out the data from that file, the objects related to that file, and everything else should be viable for garbage collection?

Vilmantas Baranauskas
  • 6,596
  • 3
  • 38
  • 50
Isaak
  • 41
  • 2
  • 2
  • 3
    Is this relevant to `C++` at all? – Galik Jun 02 '16 at 08:11
  • 1
    You might want to use `fileNames.addAll(files);` in your constructor. – Kayaman Jun 02 '16 at 08:19
  • I would look at a heap dump to see where memory is being retained. What you you say, it appears the problem could be somewhere else in your code. – Peter Lawrey Jun 02 '16 at 08:38
  • At this point the quickest way to figure out what's wrong is to run in an IDE debugger and set some breakpoints. If you stop after processing a few hundred files you should be able to easily find the leaked memory. – Jim Garrison Jun 02 '16 at 08:40

5 Answers5

1

With Java 8, GZIP support has moved from Java code to native zlib usage.

Non-closed GZIP streams leak native memory (I really said "native" not "heap" memory) and it is far from easy to diagnose. Depending of application usage of such streams, operating system may reach its memory limit quite fast.

Symptom is that operating system process memory usage is not consistent with JVM memory usage produced by Native Memory Tracking https://docs.oracle.com/javase/8/docs/technotes/guides/vm/nmt-8.html

You will find full story details at http://www.evanjones.ca/java-native-leak-bug.html

Yves Martin
  • 10,217
  • 2
  • 38
  • 77
0

The last call to setFile won't close your BufferedReader so you are leaking ressources.

Indeed in nextLine you read the first file until the end. When the end is reached you call setFile and check if there is more file to process. However if there is no more file you return imediatly without closing the last BufferReader user.

Furthermore if you don't process all files you will have a ressource still in use.

JEY
  • 6,973
  • 1
  • 36
  • 51
  • You are correct. I fixed that but it doesn't account for my problem because I only ever create one of this object and read hundreds of files through it. – Isaak Jun 02 '16 at 16:01
0

There is at least one leak in your code: Method setFile() does not close the last BufferedReader because the if (fileNames.size() == 0) check comes before if (br != null) check.

However, this could lead to the described effect only if LogFileStream is instantiated multiple times.

It would also be better to use LinkedList instead of ArrayList as fileNames.remove(0) is more 'expensive' on the ArrayList than on the LinkedList. You could instantiate it using following single line in the constructor: fileNames = new LinkedList<>(files);

Vilmantas Baranauskas
  • 6,596
  • 3
  • 38
  • 50
0

Every once in a while, you could flush() or close() the BufferedReader. This will clear the reader's contents, so maybe every time you use the setFile() method, flush the reader. Then, just before every call like br = new BufferedReader(decoder, 131072), close() the BufferedReader

JD9999
  • 394
  • 1
  • 7
  • 18
-1

The GC starts to work after you close your connection/ reader. If you are using Java 7 or above, you may want to consider to use the try-with-resource statement which is a better way to deal with IO operation.https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html

Mark Xie
  • 60
  • 6