I have a massive 25GB CSV file. I know that there are ~500 Million records in the file.
I want to do some basic analysis with the data. Nothing too fancy.
I don't want to use Hadoop/Pig, not yet atleast.
I have written a java program to do my analysis concurrently. Here is what I am doing.
class MainClass {
public static void main(String[] args) {
long start = 1;
long increment = 10000000;
OpenFileAndDoStuff a = new OpenFileAndDoStuff[50];
for(int i=0;i<50;i++) {
a[i] = new OpenFileAndDoStuff("path/to/50GB/file.csv",start,start+increment-1);
a[i].start();
start += increment;
}
for(OpenFileAndDoStuff obj : a) {
obj.join();
}
//do aggregation
}
}
class OpenFileAndDoStuff extends Thread {
volatile HashMap<Integer, Integer> stuff = new HashMap<>();
BufferedReader _br;
long _end;
OpenFileAndDoStuff(String filename, long startline, long endline) throws IOException, FileNotFoundException {
_br = new BufferedReader(new FileReader(filename));
long counter=0;
//move the bufferedReader pointer to the startline specified
while(counter++ < start)
_br.readLine();
this._end = end;
}
void doStuff() {
//read from buffered reader until end of file or until the specified endline is reached and do stuff
}
public void run() {
doStuff();
}
public HashMap<Integer, Integer> getStuff() {
return stuff;
}
}
I thought doing this I could open 50 bufferedReaders, all reading 10 million lines chucks in parallel and once all of them are done doing their stuff, I'd aggregate them.
But, the problem I face is that even though I ask 50 threads to start, only two start at a time and can read from the file at a time.
Is there a way I can make all 50 of them open the file and read form it at the same time ? Why am I limited to only two readers at a time ?
The file is on a windows 8 machine and java is also on the same machine.
Any ideas ?