How to get data from regularly appended log file in Apache Spark?

Question

I have one Apache access log file which has some data and it is continuously increasing. I want to analyze that data using Apache Spark Streaming API.

And Spark is new for me and i created one program in which ,i use jssc.textFileStream(directory) function to get log data. but its not work as per my requirement.

please suggest me some approaches to analyze that log file using spark.

Here is my code.

SparkConf conf = new SparkConf()
                .setMaster("spark://192.168.1.9:7077")
                .setAppName("log streaming")
                .setSparkHome("/usr/local/spark")
                .setJars(new String[] { "target/sparkstreamingdemo-0.0.1.jar" });
        StreamingContext ssc = new StreamingContext(conf, new Duration(5000));
        DStream<String> filerdd = ssc.textFileStream("/home/user/logs");
        filerdd.print();
        ssc.start();
        ssc.awaitTermination();

This code does not return any data from existing files. This is only work when i create a new file but when i update that new file, program again does not return updated data.

Could you please provide your code? Otherwise it is extremely difficult to identify the root of the problem. — Mikel Urkia, Feb 16 '15 at 13:03
Hi @MikelUrkia i added my code here. please give some guideline to analyze log files. — Kaushal, Feb 17 '15 at 06:13
Spark will process your new files but will not be able to see you have updated one of the processed files. Spark processes files one time and that is it. You can change the name of the file if you want to process it again. — ahars, Feb 18 '15 at 13:31
Is there any way to process continuous appended files like log files? — Kaushal, Feb 19 '15 at 06:24

score 3 · Accepted Answer · edited May 23 '17 at 12:00

If the file is modified in real-time you can use Tailer from Apache Commons IO. That's the simpliest sample:

     public void readLogs(File f, long delay) {
        TailerListener listener = new MyTailerListener();
        Tailer tailer = new Tailer(f, listener, delay);

        // stupid executor impl. for demo purposes
        Executor executor = new Executor() {
            public void execute(Runnable command) {
                command.run();
             }
        };
        executor.execute(tailer);       
    }

    public class MyTailerListener extends TailerListenerAdapter {
        public void handle(String line) {
            System.out.println(line);
        }
    }

The code above may be used as a log reader for Apache Flume and applied as a source. Then you need to configure Flume sink to redirect collected logs to Spark stream and apply Spark for analyzing data from Flume stream (http://spark.apache.org/docs/latest/streaming-flume-integration.html)

More details about Flume setup in this post: real time log processing using apache spark streaming

E.g. streaming module may be configured to get the data from the console output — Victor Bashurov, Apr 07 '16 at 08:30

How to get data from regularly appended log file in Apache Spark?

1 Answers1