Parallel File Processing: What are recommended ways?

Question

This is by large combination of design and code problem.

Use Case
- Given many log files in range (2MB - 2GB), I need to parse each of these logs and apply some processing, generate Java POJO.
- For this problem, lets assume that we have just 1 log file
- Also, the idea is to making best use of System. Multiple cores are available.

Alternative 1
- Open file (synchronous), read each line, generate POJOs

FileActor -> read each line -> List<POJO>

Pros: simple to understand
Cons: Serial Process, not taking advantage of multiple cores in the system

Alternative 2
- Open File (synchronous), read N lines (N is configurable), pass on to different actors to process

                                                    / LogLineProcessActor 1
FileActor -> LogLineProcessRouter (with 10 Actors) -- LogLineProcessActor 2
                                                    \ LogLineProcessActor 10

Pros Some parallelization, by using different actors to process part of lines. Actors will make use of available cores in the system (? how, may be?)
Cons Still Serial, because file read in serial fashion

Questions
- is any of the above choice a good choice?
- Are there better alternatives?

Please provide valuable thoughts here

Thanks a lot

I think [ParallelStreams](https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html) might be suitable for your problem. — Turing85, May 07 '15 at 18:57
Or even https://storm.apache.org/ if you're continually getting new files and want a really robust pipeline. — Dathan, May 07 '15 at 19:04
This solution needs to be installed on customer machines so I am not sure if `Storm` is feasible. — daydreamer, May 07 '15 at 19:07
If it's a log file to analyze you may probably also make use of [logstash](http://logstash.net/) — makasprzak, May 07 '15 at 19:56

score 2 · Answer 1 · edited May 23 '17 at 11:51

Why not take advantage of what's already available, and use the paralell stream stuff, that comes with jdk 1.8? I would start with something like this, and see how it performs:

Files.lines(Paths.get( /* path to a log file */ ))
     .parallel() // make the stream work paralell
     .map(YourBean::new) // Or some mapping method to your bean class
     .forEach(/* process here the beans*/);

You may need some tweaks with the thread pooling, because paralell() by default is executed using ForkJoinPool.commonPool(), and you can't really customize it to achieve maximum performance, but people seem to find workarounds for that too, some stuff about the topic here.

Carlos Vilchez · Answer 2 · 2015-05-08T11:12:51.753

The alternative 2 looks good. I would just change a thing. Read the biggest chunk of file you can. IO will be a problem if you do it in small bursts. As there are several files, I would create an actor to get the name of the files, reading a particular folder. Then it will send the path to each file to the LogLineReader. It will read a big chunk of the file. And finally it will send each line to the LogLineProcessActor. Be aware that they may process the lines out of order. If that is not a problem, they will keep your CPU busy.

If you feel adventurous, you could also try the new akka stream 1.0.

Parallel File Processing: What are recommended ways?

2 Answers2

Linked