7

I have a program that generates a lot of data and puts it in a queue to write but the problem is its generating data faster than I'm currently writing(causing it to max memory and start to slow down). Order does not matter as I plan to parse the file later.

I looked around a bit and found a few questions that helped me design my current process(but I still find it slow). Here's my code so far:

//...background multi-threaded process keeps building the queue..
FileWriter writer = new FileWriter("foo.txt",true);
        BufferedWriter bufferWritter = new BufferedWriter(writer);
        while(!queue_of_stuff_to_write.isEmpty()) {
            String data = solutions.poll().data;
            bufferWritter.newLine();
            bufferWritter.write(data);
        }
        bufferWritter.close();

I'm pretty new to programming so I maybe assessing this wrong(maybe a hardware issue as I'm using EC2), but is there a to very quickly dump the queue results into a file or if my approach is okay can I improve it somehow? As order does not matter, does it make more sense to write to multiple files on multiple drives? Will threading make it faster?,etc..I'm not exactly sure the best approach and any suggestions would be great. My goal is to save the results of the queue(sorry no outputting to /dev/null :-) and keep memory consumption as low as possible for my app(I'm not 100% sure but the queue fills up 15gig, so I'm assuming it'll be a 15gig+ file).

Fastest way to write huge data in text file Java (realized I should use buffered writer) Concurrent file write in Java on Windows (made me see that maybe multi-threading writes wasn't a great idea)

Community
  • 1
  • 1
Lostsoul
  • 25,013
  • 48
  • 144
  • 239
  • I understand cpu speed > hard drive speed, so writing will probably always lose to processing, I'm just trying to figure out how to help hd speed get a bit closer to handling it. – Lostsoul Apr 09 '12 at 16:53
  • A lot depends on what your bottle neck is. I suspect if you max out the bandwidth of you disk IO (which appears to be your question) you can max out your account as well (in terms of cost) I agree multi-threading the write won't help much. – Peter Lawrey Apr 09 '12 at 17:32
  • A rough calculation is that 15 GB will cost you $4 each time. – Peter Lawrey Apr 09 '12 at 17:33
  • 1
    @PeterLawrey There's no cost for ephemeral storage(its included with the instance but not persistent), I have a little less than a TB avail. – Lostsoul Apr 09 '12 at 17:35
  • If you are not concerned about cost, I would first see how long it takes to write a file as fast as possible. e.g. using `dd` from the command line. Or you can use NIO in large blocks e.g. 32-256KB. – Peter Lawrey Apr 09 '12 at 18:01

4 Answers4

2

Looking at that code, one thing that springs to mind is character encoding. You're writing strings, but ultimately, it's bytes that go to the streams. A writer character-to-byte encoding under the hood, and it's doing it in the same thread that is handling writing. That may mean that there is time being spent encoding that is delaying writes, which could reduce the rate at which data is written.

A simple change would be to use a queue of byte[] instead of String, do the encoding in the threads which push onto the queue, and have the IO code use a BufferedOutputStream rather than a BufferedWriter.

This may also reduce memory consumption, if the encoded text takes up less than two bytes per character on average. For latin text and UTF-8 encoding, this will usually be true.

However, i suspect it's likely that you're simply generating data faster than your IO subsystem can handle it. You will need to make your IO subsystem faster - either by using a faster one (if you're on EC2, perhaps renting a faster instance, or writing to a different backend - SQS vs EBS vs local disk, etc), or by ganging several IO subsystems together in parallel somehow.

Tom Anderson
  • 46,189
  • 17
  • 92
  • 133
1

Yes, writing multiple files on multiple drives should help, and if nothing else is writing to those drives at the same time, performance should scale linearly with the number of drives until I/O is no longer the bottleneck. You could also try a couple other optimizations to boost performance even more.

If you're generating huge files and the disk simply can't keep up, you can use a GZIPOutputStream to shrink the output--which, in turn, will reduce the amount of disk I/O. For non-random text, you can usually expect a compression ratio of at least 2x-10x.

    //...background multi-threaded process keeps building the queue..
    OutputStream out = new FileOutputStream("foo.txt",true);
    OutputStreamWriter writer = new OutputStreamWriter(new GZIPOutputStream(out));
    BufferedWriter bufferWriter = new BufferedWriter(writer);
    while(!queue_of_stuff_to_write.isEmpty()) {
        String data = solutions.poll().data;
        bufferWriter.newLine();
        bufferWriter.write(data);
    }
    bufferWriter.close();

If you're outputting regular (i.e., repetitive) data, you might also want to consider switching to a different output format--for example, a binary encoding of the data. Depending on the structure of your data, it might be more efficient to store it in a database. If you're outputting XML and really want to stick to XML, you should look into a Binary XML format, such as EXI or Fast InfoSet.

rob
  • 6,147
  • 2
  • 37
  • 56
0

I guess as long as you produce your data out of calculations and do not load your data from another data source, writing will always be slower than generating your data.

You can try writing your data in multiple files (not in the same file -> due to synchronization problems) in multiple threads (but I guess that will not fix your problem).

Is it possible for you to wait for the writing part of your application to finish its operation and continue your calculations?

Another approach is: Do you empty your queue? Does solutions.poll() reduce your solutions queue?

zip
  • 579
  • 1
  • 3
  • 16
0

writing to different files using multiple threads is a good idea. Also, you should look into setting the BufferedWriters buffer size, which you can do from the constructor. Try initializing with a 10 Mb buffer and see if that helps

ControlAltDel
  • 33,923
  • 10
  • 53
  • 80
  • Is it? Writing two files in parallel to the same mechanical HDD will take a lot longer than writing first one, then the other. – Roman Starkov Mar 20 '13 at 16:11