7

Been looking around for a little while now and I'm a bit confused on this issue. I want to be able to take an input stream and read it concurrently in segments. The segments don't interact with each other they are just values to be inserted or updated in a database from an uploaded file. Is it possible to read an input stream concurrently by setting a segment size and then just skipping forward before spinning off a new thread to handle the conversion and insert/update?

Essentially the file is a list of ID's (one ID per line), although it would be preferable if I could specify a separator. Some files can be huge so I would like to process and convert the data to be into segments so that after inserting/updating to the database the JVM memory can be freed up. Is this possible? And if so are there any libraries out there that do this already?

Cheers and thanks in advance,

Alexei Blue.

Alexei Blue
  • 1,762
  • 3
  • 21
  • 36

3 Answers3

9

A good approach might instead be to have a single reader that reads chunks and then hands each chunk off to a worker thread from a thread pool. Given that these will be inserted into a database the inserts will be by far the slow parts compared to reading the input so a single thread should suffice for reading.

Below is an example that hands off processing of each line from System.in to a worker thread. Performance of database inserts is much better if you perform a large number inserts within a single transaction so passing in a group of say 1000 lines would be better than passing in a single line as in the example.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class Main {
    public static class Worker implements Runnable {
        private final String line;

        public Worker(String line) {
            this.line = line;
        }

        @Override
        public void run() {
            // Process line here.
            System.out.println("Processing line: " + line);
        }
    }

    public static void main(String[] args) throws IOException {
        // Create worker thread pool.
        ExecutorService service = Executors.newFixedThreadPool(4);

        BufferedReader buffer = new BufferedReader(new InputStreamReader(System.in));
        String line;

        // Read each line and hand it off to a worker thread for processing.
        while ((line = buffer.readLine()) != null) {
            service.execute(new Worker(line));
        }
    }
}
Ed Plese
  • 1,568
  • 10
  • 14
  • Hi Ed thanks for the example ^.^ So if I read 1000 lines into a StringBuffer and then pass this off to a worker thread to be processed and inserted/updated in the database do you think this would be a good approach? :) – Alexei Blue Apr 23 '13 at 09:11
  • It'd probably be best to read the 1000 lines into a `List` or a `String[]`. If you read them into a `StringBuffer` then it'd be a single string and you'd need to parse out the individual lines a second time. – Ed Plese Apr 23 '13 at 11:15
2

First of all, to read the file concurrently starting from different offsets you need random access to the file, this means reading a file from any position. Java allows this with RandomAccessFile in java.in or with SeekableByteChannel in java.nio:

Best Way to Write Bytes in the Middle of a File in Java

http://docs.oracle.com/javase/tutorial/essential/io/rafs.html

I think for the speed reasons you will prefer java.nio. Java NIO FileChannel versus FileOutputstream performance / usefulness

Now you know how to read from any position but you need to do this concurrently. It's not possible with the same file access object because they hold the position in the file. Thus you need as many file access objects as threads. Since you are reading not writing that should be Ok.

Now you know how to read the same file concurrently from many different offsets.

But think about the performance. Despite the number of threads you have only ONE disk drive and random reads (many threads access the same file) performance is much-much slower then sequential reads (one thread reads one file). Even if it's raid 0 or 1 - does not matter. Sequential reading is always much faster. So in you case I would advise you to read the file in one thread and supply other threads with the data from that reading thread.

Community
  • 1
  • 1
Vitaly
  • 2,760
  • 2
  • 19
  • 26
1

I don't think you can read an InputStream concurrently. That is why the contract defines read, reset, and mark - the idea is that the stream keeps track internally what has been read and what has not.

If you're reading a file, just open multiple streams. You could use the skip() method to move the marker ahead for other threads to avoid duplicate line processing. BufferedReader may help some too, as it offers reading line by line.

Scott Heaberlin
  • 3,364
  • 1
  • 23
  • 22
  • Yeah the buffered reader + skip is the way i'm currently doing, needs a bit more work but i'm sure using a single sequential read and moving work to other threads will be a good improvement. Cheers for the links. – Alexei Blue Apr 23 '13 at 09:09