Processing large number of text files in java

Question

I am working on an application which has to read and process ~29K files (~500GB) everyday. The files will be in zipped format and available on a ftp.

What I have done: I plan to download and the files from ftp, unzip it and process using multi-threading, which has reduced the processing time significantly (when number of active threads are fixed to a smaller number). I've written some code and tested it for ~3.5K files(~32GB). Details here: https://stackoverflow.com/a/32247100/3737258

However, the estimated processing time, for ~29K files, still seems to be very high.

What I am looking for: Any suggestion/solution which could help me bring the processing time of ~29K files, ~500GB, to 3-4 hours.

Please note that, each files have to be read line by line and each line has to be written to a new file with some modification(some information removed and some new information be added).

score 2 · Answer 1 · answered Aug 27 '15 at 11:03

You should profile your application and see where the current bottleneck is, and fix that. Proceed until you are at your desired speed or cannot optimize further.

For example:

Maybe you unzip to disk. This is slow, to do it in memory.
Maybe there is a load of garbage collection. See if you can re-use stuff
Maybe the network is the bottleneck.. etc.

You can, for example, use visualvm.

score 0 · Answer 2 · edited May 23 '17 at 11:58

It's hard to provide you one solution for your issue, since it might be that you simply reached the hardware limit.

Some Ideas:

You can parallelize the process which is necessary to process the read information. There you could provide multiple read lines to one thread (out of a pool), which processes these sequentially
Use java.nio instead of java.io see: Java NIO FileChannel versus FileOutputstream performance / usefulness
Use a profiler
Instead of the profiler, simply write log messages and measure the duration in multiple parts of your application
Optimize the Hardware (use SSD drives, expiriment with block size, filesystem, etc.)

score 0 · Answer 3 · answered Aug 28 '15 at 01:21

0

If you are interested in parallel computing then please try Apache spark it is meant to do exactly what you are looking for.

answered Aug 28 '15 at 01:21

Lokesh Kumar P

369
5
20

Processing large number of text files in java

3 Answers3