6

I am processing a number of text files line by line using BufferReader.readlLine().

Two files having same size 130MB but one take 40sec to get processed while other takes 75 sec.

I noticed one file has 1.8 million of lines while other has 2.1 million. But when I tried to process a file with 3.0 million lines having same size it took 30 mins to process.

So my question is:

  1. Is this behavior because of seek time of buffer reader (I want to know how BufferedReader works or parses the file line by line?)

  2. Is there any way I can read the file line by line in a faster way?

Ok friends, I am providing some more details.

I am splitting the line into three parts using regex, then using SimpleUnsortedWriter (provided by Cassandra) I am writing it to some file as key, column and value. After the 16MB data is processed it flushes to disk.

But the processing logic is same for all the files, even one file of size 330MB but less no of lines around 1 million gets processed in 30 sec. What could be the reason?

deviceWriter = new SSTableSimpleUnsortedWriter(
        directory,
        keyspace,
        "Devices",
        UTF8Type.instance,
        null,
        16);

Pattern pattern = Pattern.compile("[\\[,\\]]");
while ((line = br.readLine()) != null)          
{
    //split the line i n row column and value
    long timestamp = System.currentTimeMillis() * 1000;
    deviceWriter .newRow(bytes(rowKey));
    deviceWriter .addColumn(bytes(colmName), bytes(value), timestamp);

}

Have changed -Xmx256M to -Xmx 1024M but it is not helping anyways.

Update: According to my observation, as I am writing into buffer (in physical memory), as the no. of writes into a buffer are increasing the newer writes are taking time. (This is my guess)

Please reply.

bluish
  • 26,356
  • 27
  • 122
  • 180
samarth
  • 3,866
  • 7
  • 45
  • 60

4 Answers4

5

The only thing BufferedReader does is read from the underlying Reader into an internal char[] buffer with a default size of 8K, and all methods work on that buffer until it's exhausted, at which point another 8K (or whatever) is read from the underlying Reader. The readLine() is sort of tacked on.

Correct use of BufferedReader should definitely not result in the running time rising from 40sec at 1.8m lines to 30 minutes at 3m lines. There must be something wrong with your code. Show it to us.

Another possibility is that your JVM does not have enough heap memory and spends most of the 30 minutes doing garbage collection because its heap is 99% full and you'd eventually get an OutOfMemoryError with larger input. What are you doing with the lines you have processed? Are they kept in memory? Does running the program with the -Xmx 1024M command line option make a difference?

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • hey thanks...provided some more details on my issue please go through – samarth Aug 25 '11 at 08:48
  • @samarth: I don't see anything wrong with the code you've posted. The easiest solution may be to do some simple profiling with VisualVM. That should tell you where all the time is spent, which will probably lead you directly to the cause of the problem. – Michael Borgwardt Aug 25 '11 at 09:00
1

BufferedReader will not seek, it simply caches chars until a newline is found and returns the line as a String, discarding (reusing) the buffer after each line. That's why you can use it with any stream or other reader, even those that does not support seeking.

So, the number of lines alone should not create such a big difference at reader level. A very long line could however create a very big string and an allocation of a lot of RAM, but that seems not to be your case (in that case it would probably throw an OutOfMemory exception for excess of GC time or similar).

For what I can see in your code, you are not doing anything wrong. I suppose you are hitting some kind of limit, since it does not seem to be RAM, maybe it has something to do with some hard limit on the Cassandra side? Have you tried commenting out the part that writes on Cassandra? just to see if it is your side or Cassandra side that is causing the problem.

Simone Gianni
  • 11,426
  • 40
  • 49
1

Look into NIO Buffered as they are more optimized than BufferReader.

Some code snippet from another forum. http://www.velocityreviews.com/forums/t719006-bufferedreader-vs-nio-buffer.html

FileChannel fc = new FileInputStream("File.txt").getChannel();
ByteBuffer buffer = ByteBuffer.allocate(1024);
fc.read(buffer);

Edit: Also lookinto this thread Read large files in Java

Community
  • 1
  • 1
1

The BufferedReader is probably not the root of your performance problem.

Based on the numbers you cite, it sounds like you have some quadratic complexity in your code. For example, for every line you read, you are re-examining every line you've read previously. I'm just speculating here, but a common example of the problem would be using a list data structure, and checking to see if the new line matches any previous lines.

erickson
  • 265,237
  • 58
  • 395
  • 493