I am new to Scala.
My requirement is that I need to read line by line and split it on particular delimiter and extract values to put in respective columns in different file.
Below is my input sample data:
ABC Log
Aug 10 14:36:52 127.0.0.1 CEF:0|McAfee|ePolicy Orchestrator|IFSSLCRT0.5.0.5/epo4.0|2410|DeploymentTask|High eventId=34 externalId=23
Aug 10 15:45:56 127.0.0.1 CEF:0|McAfee|ePolicy Orchestrator|IFSSLCRT0.5.0.5/epo4.0|2890|DeploymentTask|Medium eventId=888 externalId=7788
Aug 10 16:40:59 127.0.0.1 CEF:0|NV|ePolicy Orchestrator|IFSSLCRT0.5.0.5/epo4.0|2990|DeploymentTask|Low eventId=989 externalId=0004
XYZ Log
Aug 15 14:32:15 142.101.36.118 cef[10612]: CEF:0|fire|cc|3.5.1|FireEye Acquisition Started
Aug 16 16:45:10 142.101.36.189 cef[10612]: CEF:0|cold|dd|3.5.4|FireEye Acquisition Started
Aug 18 19:50:20 142.101.36.190 cef[10612]: CEF:0|fire|ee|3.5.6|FireEye Acquisition Started
In above data I need to read first part under 'ABC log' heading and extract values from each line and put it under respective column.Here few first values columns names are hardcoded and last columns i need to extract by splitting on "=" i.e. eventId=34 externalId=23 => col = eventId value = 34 and col = value = externalId
Column names
date time ip_address col1 col2 col3 col4 col5
I want output like below:
This is for first part 'ABC Log' and put it into one file and same for rest.
date time ip_address col1 col2 col3 col4 col5 col6 col7
Aug 10 14:36:52 127.0.0.1 CEF:0 McAfee ePolicy Orchestrator IFSSLCRT0.5.0.5/epo4.0 2410 DeploymentTask High
Aug 10 15:45:56 127.0.0.1 CEF:0 McAfee ePolicy Orchestrator IFSSLCRT0.5.0.5/epo4.0 2890 DeploymentTask Medium
Below code I have been trying :
package AV_POC_Parsing
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.log4j.Logger
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
// For implicit conversions like converting RDDs to DataFrames
//import org.apache.spark.implicits._
//import spark.implicits._
object scala {
def main(args: Array[String]) {
// create Spark context with Spark configuration
val sc = new SparkContext(new SparkConf().setAppName("AV_Log_Processing").setMaster("local[*]"))
// Read text file in spark RDD
val textFile = sc.textFile("input.txt");
val splitRdd = textFile.map( line => line.split(" "))
// RDD[ Array[ String ]
// printing values
splitRdd.foreach { x => x.foreach { y => println(y) } }
// how to store split values in different column and write it into file
}}
How to split on two delimiters in Scala.
Thanks