I need to read a huge file (15+GB) and perform some minor modifications (add some newlines so a different parser can actually work with it). You might think that there are already answers for doing this normally:
but my entire file is on one line.
My general approach so far is very basic:
char[] buffer = new char[X];
BufferedReader reader = new BufferedReader(new ReaderUTF8(new FileInputStream(new File("myFileName"))), X);
char[] bufferOut = new char[X+a little];
int bytesRead = -1;
int i = 0;
int offset = 0;
long totalBytesRead = 0;
int countToPrint = 0;
while((bytesRead = reader.read(buffer)) >= 0){
for(i = 0; i < bytesRead; i++){
if(buffer[i] == '}'){
bufferOut[i+offset] = '}';
offset++;
bufferOut[i+offset] = '\n';
}
else{
bufferOut[i+offset] = buffer[i];
}
}
writer.write(bufferOut, 0, bytesRead+offset);
offset = 0;
totalBytesRead += bytesRead;
countToPrint += 1;
if(countToPrint == 10){
countToPrint = 0;
System.out.println("Read "+((double)totalBytesRead / originalFileSize * 100)+" percent.");
}
}
writer.flush();
After some experimentation, I've found that a value of X larger than a million gives optimal speed - it looks like I'm getting about 2% every 10 minutes, while a value of X of ~60,000 only got 60% in 15 hours. Profiling reveals that I'm spending 96+% of my time in the read() method, so that's definitely my bottleneck. As of writing this, my 8 million X version has finished 32% of the file after 2 hours and 40 minutes, in case you want to know how it performs long-term.
Is there a better approach for dealing with such a large, one-line file? As in, is there a faster way of reading this type of file that gives me a relatively easy way of inserting the newline characters?
I am aware that different languages or programs could probably handle this gracefully, but I'm limiting this to a Java perspective.