0

My aim is to read from a large file, process 2 lines at a time, and write the result to a new file(s). These files can get very large, from 1GB to 150GB in size, so I'd like to attempt to do this processing using the least RAM possible

The processing is very simple: The lines split by a tab delimited, certain elements are selected, and the new String is written to the new files.

So far I have attempted using BufferedReader to read the File and PrintWriter to output the lines to a file:

while((line1 = br.readLine()) != null){
        if(!line1.startsWith("@")){
            line2 = br.readLine();
            recordCount++;
            one.println(String.format("%s\n%s\n+\n%s",line1.split("\t")[0] + ".1", line1.split("\t")[9], line1.split("\t")[10]));
            two.println(String.format("%s\n%s\n+\n%s",line2.split("\t")[0] + ".2", line2.split("\t")[9], line2.split("\t")[10]));
        }
    }

I have also attempted to uses Java8 Streams to read and write from the file:

stream.forEach(line -> {
        if(!line.startsWith("@")) {
            try {
                if (counter.getAndIncrement() % 2 == 0)
                    Files.write(path1, String.format("%s\n%s\n+\n%s", line.split("\t")[0] + ".1", line.split("\t")[9], line.split("\t")[10]).getBytes(), StandardOpenOption.APPEND);

                else
                    Files.write(path2, String.format("%s\n%s\n+\n%s", line.split("\t")[0] + ".2", line.split("\t")[9], line.split("\t")[10]).getBytes(), StandardOpenOption.APPEND);

            }catch(IOException ioe){

            }
        }
    });

Finally, I have tried to use an InputStream and scanner to read the file and PrintWriter to output the lines:

inputStream = new FileInputStream(inputFile);
    sc = new Scanner(inputStream, "UTF-8");
    String line1, line2;

    PrintWriter one = new PrintWriter(new FileOutputStream(dotOne));
    PrintWriter two = new PrintWriter(new FileOutputStream(dotTwo));

    while(sc.hasNextLine()){
        line1 = sc.nextLine();
        if(!line1.startsWith("@")) {
            line2 = sc.nextLine();
            one.println(String.format("%s\n%s\n+\n%s",line1.split("\t")[0] + ".1", line1.split("\t")[9], line1.split("\t")[10]));
            two.println(String.format("%s\n%s\n+\n%s",line2.split("\t")[0] + ".2", line2.split("\t")[9], line2.split("\t")[10]));

        }
    }

The issue that I'm facing is that the program seems to be storing either the data to write, or the input file data into RAM.

All of the above methods do work, but use more RAM than I'd like them to.

Thanks in advance,

Sam

Sam
  • 1,234
  • 3
  • 17
  • 32
  • Hm, what makes you think it's keeping too much data in memory? The examples above should be perfectly acceptable, assuming that the maximum length of the line is reasonable. – Gorazd Rebolj Nov 22 '17 at 15:40
  • The input file that I'm currently using is 700MB. When I run the program and watch the memory usage, it shoots up to 4-5Gb. I've commented out the lines that write to file and the memory used is under 500mb. – Sam Nov 22 '17 at 15:43
  • Is this heap memory you're monitoring? – Gorazd Rebolj Nov 22 '17 at 15:45
  • Yes, I'm printing out Runtime.getRuntime().totalMemory(). I've reverted the code back to the first example in my question, and the result is 6237978624 - so 6.2Gb. – Sam Nov 22 '17 at 15:49
  • Since it's the PrintWriter, I'm assuming it's storing the data in RAM until it can write it to the new file? – Sam Nov 22 '17 at 15:52
  • Also see [Standard concise way to copy a file in Java?](https://stackoverflow.com/q/106770/608639), [File copy/move methods and approaches explanation, comparison](https://stackoverflow.com/q/31123067/608639), [Reading and writting a large file using Java NIO](https://stackoverflow.com/q/41115869/608639), etc. – jww Sep 11 '18 at 19:10

2 Answers2

0

What you did not try is a MemoryMappedByteBuffer. The FileChannel.map might be usable for your purpose, not allocating in java memory.

Functioning code with a self made byte buffer would be:

try (FileInputStream fis = new FileInputStream(source);
        FileChannel fic = fis.getChannel();
        FileOutputStream fos = new FileOutputStream(target);
        FileChannel foc = fos.getChannel()) {
    ByteBuffer buffer = ByteBuffer.allocate(1024);
    while (true) {
        int nread = fic.read(buffer);
        if (nread == -1) {}
            break;
        }
        buffer.flip();
        foc.write(buffer);
        buffer.clear();
    }
}

Using fic.map to consecutively map regions into OS memory seems easy, but such more complex code I would need to test first.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • When all you want to do, is copying files, you don’t need to deal with `map` manually, as [`FileChannel.transferTo(…)`](https://docs.oracle.com/javase/8/docs/api/java/nio/channels/FileChannel.html#transferTo-long-long-java.nio.channels.WritableByteChannel-) does for you. Or just use [`Files.copy(…)`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#copy-java.nio.file.Path-java.nio.file.Path-java.nio.file.CopyOption...-). But when you want to process *lines*, there is no way around using a reader and manual buffer management has even lesser benefit. – Holger Nov 22 '17 at 17:13
  • But reading a file line by line and writing to another doesn’t hold more than the currently used lines and some rather small fixed size buffers. When the OP sees a memory usage of 6GiB, the most likely reason is that he *has* such a big heap size and the JVM doesn’t waste CPU cycles for garbage collection when there still are gigabytes of unused RAM. If that’s an issue, just limiting the max heap size could help… However `Runtime.totalMemory()` doesn’t report the used size anyway, so it’s perfectly possible that calling `Runtime.freeMemory()` will reveal that most of that memory is free… – Holger Nov 22 '17 at 17:30
0

When creating PrintWriter set autoFlush to true:

new PrintWriter(new FileOutputStream(dotOne), true)

This way the buffered data will be flushed with every println.

Gorazd Rebolj
  • 803
  • 6
  • 10