1

I am working on an application which has to read and process ~29K files (~500GB) everyday. The files will be in zipped format and available on a ftp.

What I have done: I plan to download and the files from ftp, unzip it and process using multi-threading, which has reduced the processing time significantly (when number of active threads are fixed to a smaller number). I've written some code and tested it for ~3.5K files(~32GB). Details here: https://stackoverflow.com/a/32247100/3737258

However, the estimated processing time, for ~29K files, still seems to be very high.

What I am looking for: Any suggestion/solution which could help me bring the processing time of ~29K files, ~500GB, to 3-4 hours.

Please note that, each files have to be read line by line and each line has to be written to a new file with some modification(some information removed and some new information be added).

Community
  • 1
  • 1
Amarjeet
  • 113
  • 2
  • 9

3 Answers3

2

You should profile your application and see where the current bottleneck is, and fix that. Proceed until you are at your desired speed or cannot optimize further.

For example:

  • Maybe you unzip to disk. This is slow, to do it in memory.
  • Maybe there is a load of garbage collection. See if you can re-use stuff
  • Maybe the network is the bottleneck.. etc.

You can, for example, use visualvm.

Rob Audenaerde
  • 19,195
  • 10
  • 76
  • 121
0

It's hard to provide you one solution for your issue, since it might be that you simply reached the hardware limit.

Some Ideas:

  • You can parallelize the process which is necessary to process the read information. There you could provide multiple read lines to one thread (out of a pool), which processes these sequentially
  • Use java.nio instead of java.io see: Java NIO FileChannel versus FileOutputstream performance / usefulness
  • Use a profiler
  • Instead of the profiler, simply write log messages and measure the duration in multiple parts of your application
  • Optimize the Hardware (use SSD drives, expiriment with block size, filesystem, etc.)
Community
  • 1
  • 1
0

If you are interested in parallel computing then please try Apache spark it is meant to do exactly what you are looking for.

Lokesh Kumar P
  • 369
  • 5
  • 20