0

What I have

I have a lot of CSV files like below:

Date;Something;
2014-03-31 15:00:01;xxx;
2015-02-01 13:20:01;xxx;
2014-03-03 17:00:03;xxx;
2014-03-03 17:00:04;xxx;

The second row is not a mistake - dates are "random", between 2014 and 2016. Fortunate most dates are similar like 2 last rows. But above sequence is real example.

How may files and why parallel?

There are 5000 files per year. Each one is gziped. So IO is not a problem. CPU is boring now.

What I need

Rows from above files grouped by day in separated files. I don't care about order inside.

What I was thinking about

Using Java parallel stream to read files. But I don't know how I can write thread-safe into multiple files? I found something similar like: Write to text file from multiple threads? and Threads and file writing But I'm not sure if this is a way to go?

Community
  • 1
  • 1
Piotr Stapp
  • 19,392
  • 11
  • 68
  • 116
  • Didn't understand why to write into a single file from multiple threads. Do you need to produce several output files (for each day group) per one input file ? Or you want to aggregate across all input files? – Andrew Butenko Jul 10 '16 at 18:38
  • I want to produce file for each day per all input files. So from group of files with mixed content, I would like to receive one file per one day. – Piotr Stapp Jul 10 '16 at 18:39
  • why are you trying to optimize for performance already? How many terabytes of data do you have? – satnam Jul 10 '16 at 18:40
  • 1 TB of unzipped data – Piotr Stapp Jul 10 '16 at 18:41
  • 1
    is the input just one file or multiple files? – satnam Jul 10 '16 at 18:41
  • the input is multiple files. They are gziped so for me it looks natural to make this in parallel. – Piotr Stapp Jul 10 '16 at 18:42
  • 3
    Here's the problem. Unless there is some kind of crazy algorithm you are using, the CPU is not going to be the bottleneck. So, going in parallel is not going to help you. You are constrained on IO. Now, if the input is like 100k different files, parallel might help. If the input is like 10 files, I don't think multithreading is going to help anything. – satnam Jul 10 '16 at 18:44
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/116938/discussion-between-satnam-and-piotr-stapp). – satnam Jul 10 '16 at 18:44
  • I would not be sure that this is IO bound. I built a similar engine (except for XML), and my output was highly compressible (95 to 97%), e.g. Gzipping around 100MB/s would only write < 5MB/s, and that was CPU bound. In the end, I used a producer/consummer approach + blocking queue communication. Producer were assigned a group of files and consummers a group of dates (e.g. by consistent hashing). This makes all writes *for a given file* single threaded, which vastly eases caching / limiting of simultaneous open streams. This problem is a good fit for an Actor pattern too. – GPI Jul 11 '16 at 09:03

2 Answers2

0

There are many possibilities:

  • pool of threads + IO approach

Pool of threads. Each thread reads one input file and writes to numerous output files (depending on the amount of output files - maybe you can have all the output files opened, maybe only some of the output files due to the limit on the numerous of opened files)

  • 2x pools of threads + Map<Int, ArrayBlockingQueue> + IO approach

Pool of threads for reading files. Each thread parses data and puts the results into an ArrayBlockingQueue instance, which is found using a hash function on the date. For each ArrayBlockingQueue there is a thread responsible for writing into output files.

  • one thread + NIO approach (non-blocking)

One thread reading from numerous input files and writing to numerous output files using NIO.

  • mixes of the above

The chosen solution depends on numerous factors:

  • is this functionality a performance bottleneck of your application?
  • is the simplest single threaded IO solution too slow?
  • how much time do you want to/can invest into it
  • your OS and your hard disk
Adam Siemion
  • 15,569
  • 7
  • 58
  • 92
  • I don't have bottlenecks now - CPU is low, IO is low. Single thread is too slow, because I can make it parallel :). About last question I don't know. – Piotr Stapp Jul 10 '16 at 18:53
  • 2
    @Piotr Stapp: this statement makes no sense at all. Software always consumes all resource up to the point that at least one resource constraints it. The only way to accelerate the operation is to identify the current limiting resource and lift the constraint. – Holger Jul 11 '16 at 10:03
  • The CPU is low because only one core is used in 100% – Piotr Stapp Jul 12 '16 at 13:52
0

My instinct is not to overthink things like these. My approach would be:

  1. Each input file is processed line by line on a pooled thread. When a new date is encountered, open a new file writer and put it into a map(date being the key).
  2. Write a line you just read into a corresponding writer (access to the writer, writing itself and adding an entry into the map has to be synchronized of course)
  3. Setup a barrier to close all output files as soon as all processing threads finish.

While being simple to implement and debug, this approach is not bulletproof nor offers optimal performance. Cons are:

  1. contention on writing to a file
  2. contention on mutating the map of file writers
  3. exhausting open file descriptors (we keep a file writer for each date)

If performance is of major concern, I would advocate for map-reduce approach where each file processing thread produces a number of date segregated files and then another process concatenates those files.

Andrew Butenko
  • 395
  • 1
  • 8