2

I am working on the application which reads large amounts of data from a file. Basically, I have a huge file (around 1.5 - 2 gigs) containing different objects (~5 to 10 millions of them per file). I need to read all of them and put them to different maps in the app. The problem is that the app runs out of memory while reading the objects at some point. Only when I set it to use -Xmx4096m - it can handle the file. But if the file will be larger, it won't be able to do that anymore.

Here's the code snippet:

String sampleFileName = "sample.file";
FileInputStream fileInputStream = null;
ObjectInputStream objectInputStream = null;
try{
    fileInputStream = new FileInputStream(new File(sampleFileName));
    int bufferSize = 16 * 1024;
    objectInputStream = new ObjectInputStream(new BufferedInputStream(fileInputStream, bufferSize));
        while (true){
            try{
                Object objectToRead = objectInputStream.readUnshared();
                if (objectToRead == null){
                    break;
                }
                // doing something with the object
            }catch (EOFException eofe){
                eofe.printStackTrace();
                break;
            } catch (Exception e) {
                e.printStackTrace();
                continue;
            }
        }
} catch (Exception e){
        e.printStackTrace();
}finally{
    if (objectInputStream != null){
        try{
            objectInputStream.close();
        }catch (Exception e2){
            e2.printStackTrace();
        }
    }
    if (fileInputStream != null){
        try{
            fileInputStream.close();
        }catch (Exception e2){
            e2.printStackTrace();
        }
    }
}

First of all, I was using objectInputStream.readObject() instead of objectInputStream.readUnshared(), so it solved the issue partially. When I increased the memory from 2048 to 4096, it started parsing the file. BufferedInputStream is already in use. From the web I've found only examples how to read lines or bytes, but nothing regarding objects, performance wise.

How can I read the file without increasing the memory for JVM and avoiding the OutOfMemory exception? Is there any way to read objects from the file, not keeping anything else in the memory?

Kakofonn
  • 73
  • 2
  • 11
  • 3
    It's simple physics: Bigger files will require more memory. There's no magic out there. Your files do not contain objects - they contain bytes that are mapped to strings that are mapped to objects. – duffymo Aug 28 '17 at 12:36
  • If you can sort data in to maps while you are reading the main file, you can use BufferReader to read file by lines and then use PrintWriter to append data to the file that already exist or create new one. – Jure Aug 28 '17 at 12:36
  • If file are too big have no choice but store them on the F.S. Read that: https://commons.apache.org/proper/commons-jcs/ – Stefano R. Aug 28 '17 at 12:37
  • (1) An embedded database like h2 could be a solution. Maybe with JPA/OMR like eclipseLink. Just as easy with java objects. (2) Repeated values could be cached, so repeated String values will use the same Striing object. – Joop Eggen Aug 28 '17 at 12:58
  • these link might be useful https://stackoverflow.com/questions/1605332/java-nio-filechannel-versus-fileoutputstream-performance-usefulness https://stackoverflow.com/questions/2356137/read-large-files-in-java – Akhil S Kamath Aug 28 '17 at 13:11

1 Answers1

2

When reading big files, parsing objects and keeping them in memory there are several solutions with several tradeoffs:

  1. You can fit all parsed objects into memory for that app deployed on one server. It either requires to store all objects in very zipped way, for example using byte or integer to store 2 numbers or some kind of shifting in other data structures. In other words fitting all objects in possible minimum space. Or increase memory for that server(scale vertically)

    a) However reading the files can take too much memory, so you have to read them in chunks. For example this is what I was doing with json files:

    JsonReader reader = new JsonReader(new InputStreamReader(in, "UTF-8"));
        if (reader.hasNext()) {
            reader.beginObject();
            String name = reader.nextName();
    
            if ("content".equals(name)) {
                reader.beginArray();
    
                parseContentJsonArray(reader, name2ContentMap);
    
                reader.endArray();
            }
            name = reader.nextName();
            if ("ad".equals(name)) {
                reader.beginArray();
    
                parsePrerollJsonArray(reader, prerollMap);
    
                reader.endArray();
            }
        }
    

    The idea is to have a way to identify when certain object starts and ends and read only that part.

    b) You can also split files to smaller ones at the source if you can, then it will be easier to read them.

  2. You can't fit all parsed objects for that app on one server. In this case you have to shard based on some object property. For example split data based on US state into multiple servers.

Hopefully it helps in your solution.