How to read text file using Scala(spark) line by line and split using delimiter and store values in respective columns?

Question

I am new to Scala.

My requirement is that I need to read line by line and split it on particular delimiter and extract values to put in respective columns in different file.

Below is my input sample data:

ABC Log

Aug 10 14:36:52 127.0.0.1 CEF:0|McAfee|ePolicy Orchestrator|IFSSLCRT0.5.0.5/epo4.0|2410|DeploymentTask|High  eventId=34 externalId=23
Aug 10 15:45:56 127.0.0.1 CEF:0|McAfee|ePolicy Orchestrator|IFSSLCRT0.5.0.5/epo4.0|2890|DeploymentTask|Medium eventId=888 externalId=7788
Aug 10 16:40:59 127.0.0.1 CEF:0|NV|ePolicy Orchestrator|IFSSLCRT0.5.0.5/epo4.0|2990|DeploymentTask|Low eventId=989 externalId=0004


XYZ Log

Aug 15 14:32:15 142.101.36.118 cef[10612]: CEF:0|fire|cc|3.5.1|FireEye Acquisition Started
Aug 16 16:45:10 142.101.36.189 cef[10612]: CEF:0|cold|dd|3.5.4|FireEye Acquisition Started
Aug 18 19:50:20 142.101.36.190 cef[10612]: CEF:0|fire|ee|3.5.6|FireEye Acquisition Started

In above data I need to read first part under 'ABC log' heading and extract values from each line and put it under respective column.Here few first values columns names are hardcoded and last columns i need to extract by splitting on "=" i.e. eventId=34 externalId=23 => col = eventId value = 34 and col = value = externalId

Column names 

date time ip_address col1 col2 col3 col4 col5

I want output like below:

This is for first part 'ABC Log' and put it into one file and same for rest.

 date    time     ip_address  col1   col2    col3          col4      col5 col6                            col7  
 Aug 10  14:36:52 127.0.0.1   CEF:0  McAfee   ePolicy Orchestrator IFSSLCRT0.5.0.5/epo4.0 2410 DeploymentTask High

Aug 10 15:45:56 127.0.0.1 CEF:0 McAfee ePolicy Orchestrator IFSSLCRT0.5.0.5/epo4.0 2890 DeploymentTask Medium

Below code I have been trying :

package AV_POC_Parsing
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.log4j.Logger

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

// For implicit conversions like converting RDDs to DataFrames

//import org.apache.spark.implicits._

//import spark.implicits._


object scala {

   def main(args: Array[String]) {

  // create Spark context with Spark configuration
    val sc = new SparkContext(new SparkConf().setAppName("AV_Log_Processing").setMaster("local[*]"))

    // Read text file in spark RDD 

    val textFile = sc.textFile("input.txt");


    val splitRdd = textFile.map( line => line.split(" "))
    // RDD[ Array[ String ]


    // printing values
    splitRdd.foreach { x => x.foreach { y => println(y) } }

   // how to store split values in different column and write it into file

}}

How to split on two delimiters in Scala.

Thanks

score 6 · Answer 1 · answered Sep 21 '17 at 14:07

Maybe it helps you.

import org.apache.spark.{SparkConf, SparkContext}

object DataFilter {

  def main(args: Array[String]): Unit = {

    // create Spark context with Spark configuration
    val sc = new SparkContext(new SparkConf().setAppName("AV_Log_Processing").setMaster("local[*]"))

    // Read text file in spark RDD
    val textFile = sc.textFile("input.txt");

    val splitRdd = textFile.map { s =>
      val a = s.split("[ |]")
      val date = Array(a(0) + " " + a(1))
      (date ++ a.takeRight(10)).mkString("\t")
    }
    // RDD[ Array[ String ]


    // printing values
    splitRdd.foreach(println)

    // how to store split values in different column and write it into file
  }
}

Thanks. this is useful. Now I got some idea how it works. – niranjan Sep 22 '17 at 13:36 — niranjan, Sep 22 '17 at 13:36

How to read text file using Scala(spark) line by line and split using delimiter and store values in respective columns?

1 Answers1