I have to analyze different log files which include retrieving time-stamp, URL, etc. I am using multithreading for this. Each thread is accessing different log file and doing the task. Program for doing it :
public class checkMultithreadedThroughput{
public static void main(String args[]){
ArrayList<String> fileNames = new ArrayList<>();
fileNames.add("log1");
fileNames.add("log2");
fileNames.add("log3");
fileNames.add("log4");
fileNames.add("log5");
fileNames.add("log6");
fileNames.add("log7");
fileNames.add("log8");
fileNames.add("log9");
Thread[] threads = new Thread[fileNames.size()];
try{
for(int i=0; i<fileNames.size(); i++){
threads[i] = new MultithreadedThroughput(fileNames.get(i));
threads[i].start();
}
}catch(Exception e){
e.printStackTrace();
}
}
}
class MultithreadedThroughput extends Thread{
String filename = null;
MultithreadedThroughput(String filename){
this.filename = filename;
}
public void run(){
calculateThroughput();
}
public void calculateThroughput(){
String line = null;
BufferedReader br = null;
try{
br = new = new BufferedReader(new FileReader(new File(filename)));
while((line = br.readLine())!=null){
//do the analysis on line
}
}catch(Exception e){
e.printStackTrace();
}
}
}
Now in the MultithreadedThroughput class which extends Thread I am reading file using BufferedReader. The whole process takes around 15 minutes (file size is big around 2GB each). I want to optimize the program in such a way that it takes less time.
The solution which I thought instead of starting threads on all log files, I will take one large log file at a time, split the large file into chunks (number of chunks equal to number of processor) and then starting threads on them OR other solution to have the same program as previous but instead of reading one line at a time, read multiple lines at a time and do the analysis.But I dont know any of them. Please explain the solution.
In the calculateThroughput method I have to estimate throughput of a URL in per hour interval. So suppose if I break the files depending on number of processor then it may break in between one interval i.e. Suppose interval start at 06.00.00 till 07:00:00 (one interval) like this their will be 24 interval (one day) in each log file. So if I break a large log file it may break in between an interval and if it does that one interval calculating how I will do. That's the problem I am facing with splitting of file.