1

I have a massive 25GB CSV file. I know that there are ~500 Million records in the file.

I want to do some basic analysis with the data. Nothing too fancy.

I don't want to use Hadoop/Pig, not yet atleast.

I have written a java program to do my analysis concurrently. Here is what I am doing.

class MainClass {
 public static void main(String[] args) {
  long start = 1;
  long increment = 10000000;
  OpenFileAndDoStuff a = new OpenFileAndDoStuff[50];
  for(int i=0;i<50;i++) {
    a[i] = new OpenFileAndDoStuff("path/to/50GB/file.csv",start,start+increment-1);
    a[i].start();
    start += increment;
  } 
  for(OpenFileAndDoStuff obj : a) {
     obj.join();
  }
  //do aggregation 
 }
}

class OpenFileAndDoStuff extends Thread {
  volatile HashMap<Integer, Integer> stuff = new HashMap<>();
  BufferedReader _br;
  long _end;
  OpenFileAndDoStuff(String filename, long startline, long endline) throws IOException, FileNotFoundException {
    _br = new BufferedReader(new FileReader(filename));
    long counter=0;
    //move the bufferedReader pointer to the startline specified
    while(counter++ < start) 
     _br.readLine();
    this._end = end;
  }
  void doStuff() {
    //read from buffered reader until end of file or until the specified endline is reached and do stuff
  }
  public void run() {
    doStuff();
  }
  public HashMap<Integer, Integer> getStuff() {
    return stuff;
  } 
}

I thought doing this I could open 50 bufferedReaders, all reading 10 million lines chucks in parallel and once all of them are done doing their stuff, I'd aggregate them.

But, the problem I face is that even though I ask 50 threads to start, only two start at a time and can read from the file at a time.

Is there a way I can make all 50 of them open the file and read form it at the same time ? Why am I limited to only two readers at a time ?

The file is on a windows 8 machine and java is also on the same machine.

Any ideas ?

anu
  • 1,017
  • 1
  • 19
  • 36
  • @etherous I already am trying and doing it, but I can't get more than two bufferedReaders to read from the file at the same time. See the description on how I am doing it. Your comment is as vague as someone telling me that there exists a perfectly round stone somewhere in the universe. – anu May 30 '14 at 22:44
  • 1
    You'll only get so far without using NIO in this case. I would use FileChannel. Would you like an example? Edit: http://docs.oracle.com/javase/tutorial/essential/io/rafs.html – etherous May 30 '14 at 22:49
  • @etherous Sure. I am looking it up on oracle docs right now, but an example would be great. But, this is exactly what I was looking for. Thanks for pointing to the right direction! – anu May 30 '14 at 22:52
  • The link I gave is very close to what you might want. The only modification you should need to do is create multiple instances and partition the file for each instance – etherous May 30 '14 at 22:59
  • 1
    It appears that this file is effectively working as a linked list. So the thread you tell to work on lines 4 million - 5 million has to actually read the first 4 million lines also. I do not believe that you are going be able to achieve high levels of concurrency unless you can get "indexed" access into the file. Then you can open the file using either a FileChannel or a RandomAccessFile and jump specifically to the portion of the file that thread should work on. – Brett Okken May 30 '14 at 23:48

2 Answers2

3

Here is a similar post: Concurrent reading of a File (java preffered)

The most important question here is what is the bottleneck in your case?

If the bottleneck is your disk IO, then there isn't much you can do at the software part. Parallelizing the computation will only make things worse, because reading the file from different parts simultaneously will degrade disk performance.

If the bottleneck is processing power, and you have multiple CPU cores, then you can take an advantage of starting multiple threads to work on different parts of the file. You can safely create several InputStreams or Readers to read different parts of the file in parallel (as long as you don't go over your operating system's limit for the number of open files). You could separate the work into tasks and run them in parallel

See the referred post for an example that reads a single file in parallel with FileInputStream, which should be significantly faster than using BufferedReader according to these benchmarks: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly#FileReaderandBufferedReader

Community
  • 1
  • 1
bstar55
  • 3,542
  • 3
  • 20
  • 24
3

One issue I see is that when a Thread is being asked to read, for example, lines 80000000 through 90000000, you are still reading in the first 80000000 lines (and ignoring them).

Maybe try java.io.RandomAccessFile.

In order to do this, you need all of the lines to be the same number of Bytes. If you cannot adjust the structure of your file, then this would not be an option. But if you can, this should allow for greater concurrency.

Thorn
  • 4,015
  • 4
  • 23
  • 42