Split single column data into multiple columns using spark RDD (scala)

Question

i am trying to read a log file using Spark CORE(RDD) and i do not have spark-csv APis to procees it smoothly. so i have to read it as textfile and later tweak it to form a DF. i am done till here ... but now all the data is coming in single column and ideally it should create 30+ columns..

Sample Data:(i have mentioned only few here but it is more than 30 columns)

ROUTE_NAME,THREADID,REQUESTID,TRANSTATUS,FailureReason,ServiceStartTime,ServiceEndTime
TCPIP,5,F20011,null,FATAL-23,24Jul2017 20:00:11.918,24Jul2017 20:00:20.090

what i have tried so far

val Fcore = sc.textFile("/home/data/instrumentationLog.log");

val FcoreZip=Fcore.zipWithIndex();// added index to removed header from data.

val FcoreData = FcoreZip.filter(s=> (s._1>0))//header removed

val FcoreDF= FcoreData.toDF(); // formed a DF

till here the complete data is in DF but comes in single column.. kindly guide me how to split into multiple columns to process further.

the input looks like a csv file, isn't it? why don't you read it using spark sql — Ramesh Maharjan, Aug 01 '18 at 08:20
See https://stackoverflow.com/questions/43508054/spark-sql-how-to-read-a-tsv-or-csv-file-into-dataframe-and-apply-a-custom-sche — thebluephantom, Aug 01 '18 at 08:25
i am using spark 1.6 and cannot use spark-scv from databricks/ any other external jars here.. and should be achieved using RDD only... this is an interview question — user1708054, Aug 01 '18 at 08:33
then you have some study to do. and to tell them to move to 2.x — thebluephantom, Aug 01 '18 at 09:10
@thebluephantom migration is not the solution... in interview they want us to solve the problem with the current setup.. like here the scenario is spark 1.6 without any third party jar... so plain RDD is suppose to be used. :) thats the issue other wise is quite easy with sql format. — user1708054, Aug 01 '18 at 14:14
Yes, life is hard, but I think you can use schemaRDD approach as I suggested — thebluephantom, Aug 01 '18 at 14:16
@RameshMaharjan ... please share any link if this is already asked.. i searching for the exact requirement but couldnt found.... its easy to understand something already written over asking a new question here :) — user1708054, Aug 01 '18 at 14:17
@thebluephantom thanks for the inputs , checking on schemaRDD part .. will update if it is resolved any way — user1708054, Aug 01 '18 at 14:19
Are you reading logfile or logfiles? I see one file, but could you not have more files to process? — thebluephantom, Aug 01 '18 at 15:37
I have an answer for you but cannot place it here as it is a duplicate. — thebluephantom, Aug 01 '18 at 15:45
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/177278/discussion-between-user1708054-and-thebluephantom). — user1708054, Aug 02 '18 at 11:45
hi All below is something i was trying .. hope it helps someone. val filesRDD = sc.textFile("/home/user/instrumentationLog.log",1); val FcoreZip=filesRDD.zipWithIndex().filter(s=>(s._2) >0); Val FcoreCol1= FcoreZip.map(s=>s._1); val linesRDD = FcoreZip.map(line => (line.trim.split(","))).map(entries=>(entries(0),entries(1).toInt,entries(2),entries(3))); val df = linesRDD.toDF("cola", "colb", "colc", "cold") df.show thanks @thebluephantom... i want to upvote you suggestions but able to do it only once. kindly let me know how to do it . — user1708054, Aug 02 '18 at 15:52

Split single column data into multiple columns using spark RDD (scala)

0 Answers0