12

I need to read a huge file (15+GB) and perform some minor modifications (add some newlines so a different parser can actually work with it). You might think that there are already answers for doing this normally:

but my entire file is on one line.

My general approach so far is very basic:

char[] buffer = new char[X];
BufferedReader reader = new BufferedReader(new ReaderUTF8(new FileInputStream(new File("myFileName"))), X);
char[] bufferOut = new char[X+a little];
int bytesRead = -1;
int i = 0;
int offset = 0;
long totalBytesRead = 0;
int countToPrint = 0;
while((bytesRead = reader.read(buffer)) >= 0){
    for(i = 0; i < bytesRead; i++){
        if(buffer[i] == '}'){
            bufferOut[i+offset] = '}';
            offset++;
            bufferOut[i+offset] = '\n';
        }
        else{
            bufferOut[i+offset] = buffer[i];
        }
    }
    writer.write(bufferOut, 0, bytesRead+offset);
    offset = 0;
    totalBytesRead += bytesRead;
    countToPrint += 1;
    if(countToPrint == 10){
        countToPrint = 0;
        System.out.println("Read "+((double)totalBytesRead / originalFileSize * 100)+" percent.");
    }
}
writer.flush();

After some experimentation, I've found that a value of X larger than a million gives optimal speed - it looks like I'm getting about 2% every 10 minutes, while a value of X of ~60,000 only got 60% in 15 hours. Profiling reveals that I'm spending 96+% of my time in the read() method, so that's definitely my bottleneck. As of writing this, my 8 million X version has finished 32% of the file after 2 hours and 40 minutes, in case you want to know how it performs long-term.

Is there a better approach for dealing with such a large, one-line file? As in, is there a faster way of reading this type of file that gives me a relatively easy way of inserting the newline characters?

I am aware that different languages or programs could probably handle this gracefully, but I'm limiting this to a Java perspective.

Community
  • 1
  • 1
Jeutnarg
  • 1,138
  • 1
  • 16
  • 28
  • 2
    you're doing a hell of a lot of array/string copying just to insert those carriage returns.w ouldn't it be just easier to scan a chunk of data for a newline location, write out what you've got so far, write a new line, then resume scanning? no array manipulation needed AT ALL. just copying chunks of data and writing out occasional newlines. – Marc B Jun 03 '16 at 18:18
  • 1
    Try to operate on the raw bytes. Don't convert to UTF-8, don't use `char[]`. – kennytm Jun 03 '16 at 18:19
  • 2
    IO like this should almost always be your bottleneck. You could consider having one thread manage the reading and another thread to do whatever work on the data once it is in memory (i.e. after the buffer is filled,) and maybe another to write stuff back if it needed to be changed. Beyond that, the program taking too long can be attributed to how slow reading from disk is. – Chris Sprague Jun 03 '16 at 18:23
  • Looks like you are already reading in "small" (X-sized) chars of data, doing operations on them and then writing it back out. For the operations, what if you used a tokenizer instead of iterating through each char? – bitsdanceforme Jun 03 '16 at 18:26
  • 3
    You don’t need a buffer. You’re already using a BufferedReader. Just read the characters one at a time and look for `'}'`. – VGR Jun 03 '16 at 18:26
  • 1
    You're managing a writer only for storing the entire raw data in-memory and update indices of curly brackets, which is not memory efficient. Try to store to indices of the curly brackets instead of storing the entire raw data, and then use those indices list to update the original file content. – yonisha Jun 03 '16 at 18:29
  • I'll apply the suggestions from Marc B and VGR and let y'all know what the results were tomorrow. (Only those suggestions because incremental changes are better and tomorrow because this takes a long time to run) – Jeutnarg Jun 03 '16 at 18:36
  • If after making code performance mods, you still need a boost, you may find increased performance for the parsing in file is copied to tmpfs/ramfs instead of ssd/hdd. – bitsdanceforme Jun 03 '16 at 18:39
  • This is late, but the results have come in. Handling characters one at a time instead of doing my own buffering and using a Scanner did make an improvement, but the accepted answer was 10x better than anything based on my original setup for reads and writes. 10x being metaphorical... I've put more detailed data as a comment on the accepted answer. – Jeutnarg Jun 17 '16 at 17:17

1 Answers1

10

You are making this far more complicated than it should be. By just making use of the buffering already provided by the standard classes you should get a thorughput of at least several MB per second without any hassles.

This simple test program processes 1GB in less than 2 minutes on my PC (including creating the test file):

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.io.Writer;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.Random;

public class TestFileProcessing {

    public static void main(String[] argv) {

        try {
            long time = System.currentTimeMillis();
            File from = new File("C:\\Test\\Input.txt");
            createTestFile(from, StandardCharsets.UTF_8, 1_000_000_000);
            System.out.println("Created file in: " + (System.currentTimeMillis() - time) + "ms");

            time = System.currentTimeMillis();
            File to = new File("C:\\Test\\Output.txt");
            doIt(from, to, StandardCharsets.UTF_8);
            System.out.println("Converted file in: " + (System.currentTimeMillis() - time) + "ms");
        } catch (IOException e) {
            throw new RuntimeException(e.getMessage(), e);
        }
    }

    public static void createTestFile(File file, Charset encoding, long size) throws IOException {
        Random r = new Random(12345);
        try (OutputStream fout = new FileOutputStream(file);
                BufferedOutputStream bout = new BufferedOutputStream(fout);
                Writer writer = new OutputStreamWriter(bout, encoding)) {
            for (long i=0; i<size; ++i) {
                int c = r.nextInt(26);
                if (c == 0)
                    writer.write('}');
                else
                    writer.write('a' + c);
            }
        }
    }

    public static void doIt(File from, File to, Charset encoding) throws IOException {
        try (InputStream fin = new FileInputStream(from);
                BufferedInputStream bin = new BufferedInputStream(fin);
                Reader reader = new InputStreamReader(bin, encoding);
                OutputStream fout = new FileOutputStream(to);
                BufferedOutputStream bout = new BufferedOutputStream(fout);
                Writer writer = new OutputStreamWriter(bout, encoding)) {
            int c;
            while ((c = reader.read()) >= 0) {
                if (c == '}')
                    writer.write('\n');
                writer.write(c);
            }
        }
    }

}

As you see no elaborate logic or excessive buffer sizes are used. What is used is simply buffering the streams closest to the hardware, the FileInput/OutputStream.

Durandal
  • 19,919
  • 4
  • 36
  • 70
  • For those who are interested in details on this vs. the original approach... My best modified original with nothing from Durandal's answer got 661K characters per second (c/s). Durandal's unmodified approach got 9.76 million c/s. Durandal's reader/writer approach with an explicit if-else logic structure got 11.17 million c/s. Durandal's logic but with my reader/writer setup got 716K c/s. – Jeutnarg Jun 17 '16 at 17:29
  • @Jeutnarg You may be able to squeeze out some additional performance at moderate cost by increasing the buffer sizes on the buffered streams. I used the single argument constructors in above code (this results in the default buffer size of 8K used). You could try the two-argument constructors explicitly specifying a larger buffer (e.g. 1024 << 10 = 1MB). Depending on the type of disk this might get you a few percent extra throughput, basically for free. – Durandal Jun 20 '16 at 16:36