How to decrease the time of log analysis for large files in java

Question

I have to analyze different log files which include retrieving time-stamp, URL, etc. I am using multithreading for this. Each thread is accessing different log file and doing the task. Program for doing it :

public class checkMultithreadedThroughput{

    public static void main(String args[]){
        ArrayList<String> fileNames = new ArrayList<>();
        fileNames.add("log1");
        fileNames.add("log2");
        fileNames.add("log3");
        fileNames.add("log4");
        fileNames.add("log5");
        fileNames.add("log6");
        fileNames.add("log7");
        fileNames.add("log8");
        fileNames.add("log9");

        Thread[] threads = new Thread[fileNames.size()];

        try{
            for(int i=0; i<fileNames.size(); i++){
                    threads[i] = new MultithreadedThroughput(fileNames.get(i));
                    threads[i].start();
            }
        }catch(Exception e){
            e.printStackTrace();
        }
    }
}

class MultithreadedThroughput extends Thread{

    String filename = null;
    MultithreadedThroughput(String filename){
        this.filename = filename;
    }

    public void run(){
       calculateThroughput();
    }

    public void calculateThroughput(){

        String line = null;
        BufferedReader br = null;
        try{
             br = new = new BufferedReader(new FileReader(new File(filename)));

            while((line = br.readLine())!=null){
                   //do the analysis on line 
            }
        }catch(Exception e){

            e.printStackTrace();
        }
    }
}

Now in the MultithreadedThroughput class which extends Thread I am reading file using BufferedReader. The whole process takes around 15 minutes (file size is big around 2GB each). I want to optimize the program in such a way that it takes less time.

The solution which I thought instead of starting threads on all log files, I will take one large log file at a time, split the large file into chunks (number of chunks equal to number of processor) and then starting threads on them OR other solution to have the same program as previous but instead of reading one line at a time, read multiple lines at a time and do the analysis.But I dont know any of them. Please explain the solution.

In the calculateThroughput method I have to estimate throughput of a URL in per hour interval. So suppose if I break the files depending on number of processor then it may break in between one interval i.e. Suppose interval start at 06.00.00 till 07:00:00 (one interval) like this their will be 24 interval (one day) in each log file. So if I break a large log file it may break in between an interval and if it does that one interval calculating how I will do. That's the problem I am facing with splitting of file.

Is this a real world application? Have you thought about using Hadoop MapReduce? — Benedikt Bünz, Apr 08 '15 at 06:03
Yes a kind of and I have to use Java only. As i have only one server if it could have been multiple then I could think of Hadoop. — SachinSarawgi, Apr 08 '15 at 06:24

laune · Answer 1 · 2015-04-08T07:54:47.620

3

I would not try and split a single file for multiple threads. This will create overhead and can't be better than doing several files in parallel.

Create the BufferedReader with a substantial buffer size, e.g. 64k or bigger. The optimum is system-dependent - you'll have to experiment. Later (due to a comment from OP:) The buffer size does not affect the application logic - data is read line by line, and the step from one hour into the next must be handled anyway by carrying the line over into the next batch.

There is no point in reading several lines at a time - readLine just fetches a line from the buffer.

Very likely you are losing time in the analysis.

edited Apr 08 '15 at 07:54

answered Apr 08 '15 at 06:06

laune

31,114
3
29
42

I have to analyze log file in per hour interval. If I take 64 KB at first then do analysis, case may arise that I have second interval data and when I take next 64 KB that data will not be present and for that next interval I will lost that data. – SachinSarawgi Apr 08 '15 at 06:35
This cannot happen if the file is closed after writing. Your original post analyzes 10 files in parallel - they can't all be open. – laune Apr 08 '15 at 06:42
Suppose in one thread I open one log file. That log file contain 24 hour data and timing starts at 8/4/2015:06.00.00 am till 9/4/2015:05:59:59. Now I am taking 64KB of data from that file. As I have taken this much data case may happen that I will take data from 06:00:00 till 07:30:00 (I am taking a case). Now I have to analyze data per hour interval so I got throughput from 06:00:00 to 07:00:00 but data from 07:00:00 to 07:30:00 is left. Now when I read second time 64KB how I will do throughput calculating of 07:0:00 till 08:00:00 as that data is loss. – SachinSarawgi Apr 08 '15 at 06:48
You have this problem irrespective of the buffer size. It must and can be handled while reading. - A better approach would be to log in smaller instalment, i.e., use a new log file for every hour. This will also reduce the size per file and (depending on the file system) the could improve the access time for reading. – laune Apr 08 '15 at 07:04
Sorry but I have dont have control over log file I am just getting the log file from somewere which contain one day data. – SachinSarawgi Apr 08 '15 at 07:07
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/74694/discussion-between-vickyarora-and-laune). – SachinSarawgi Apr 08 '15 at 07:12
Added the result of our discussion to my answer. – laune Apr 08 '15 at 07:55

score -1 · Answer 2 · answered Apr 08 '15 at 06:03

-1

I don't think you can do the job faster because more threads do not help if your processor doesn't have enough cores.

answered Apr 08 '15 at 06:03

user4759923

531
3
12

multithreading even on a single-core processor should help - some thread does processing while others wait for data to be read from the disk. – Alexei Kaigorodov Apr 08 '15 at 09:05
OK, I agree. I didn't think of it enough. – user4759923 Apr 09 '15 at 05:59

How to decrease the time of log analysis for large files in java

2 Answers2