0

i am trying to read a log file using Spark CORE(RDD) and i do not have spark-csv APis to procees it smoothly. so i have to read it as textfile and later tweak it to form a DF. i am done till here ... but now all the data is coming in single column and ideally it should create 30+ columns..

Sample Data:(i have mentioned only few here but it is more than 30 columns)

ROUTE_NAME,THREADID,REQUESTID,TRANSTATUS,FailureReason,ServiceStartTime,ServiceEndTime
TCPIP,5,F20011,null,FATAL-23,24Jul2017 20:00:11.918,24Jul2017 20:00:20.090

what i have tried so far

val Fcore = sc.textFile("/home/data/instrumentationLog.log");

val FcoreZip=Fcore.zipWithIndex();// added index to removed header from data.

val FcoreData = FcoreZip.filter(s=> (s._1>0))//header removed

val FcoreDF= FcoreData.toDF(); // formed a DF

till here the complete data is in DF but comes in single column.. kindly guide me how to split into multiple columns to process further.

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
user1708054
  • 191
  • 1
  • 1
  • 19
  • 1
    the input looks like a csv file, isn't it? why don't you read it using spark sql – Ramesh Maharjan Aug 01 '18 at 08:20
  • See https://stackoverflow.com/questions/43508054/spark-sql-how-to-read-a-tsv-or-csv-file-into-dataframe-and-apply-a-custom-sche – thebluephantom Aug 01 '18 at 08:25
  • i am using spark 1.6 and cannot use spark-scv from databricks/ any other external jars here.. and should be achieved using RDD only... this is an interview question – user1708054 Aug 01 '18 at 08:33
  • then you have some study to do. and to tell them to move to 2.x – thebluephantom Aug 01 '18 at 09:10
  • You can use schemaRDD approach – thebluephantom Aug 01 '18 at 09:16
  • @thebluephantom migration is not the solution... in interview they want us to solve the problem with the current setup.. like here the scenario is spark 1.6 without any third party jar... so plain RDD is suppose to be used. :) thats the issue other wise is quite easy with sql format. – user1708054 Aug 01 '18 at 14:14
  • 1
    Yes, life is hard, but I think you can use schemaRDD approach as I suggested – thebluephantom Aug 01 '18 at 14:16
  • @RameshMaharjan ... please share any link if this is already asked.. i searching for the exact requirement but couldnt found.... its easy to understand something already written over asking a new question here :) – user1708054 Aug 01 '18 at 14:17
  • @thebluephantom thanks for the inputs , checking on schemaRDD part .. will update if it is resolved any way – user1708054 Aug 01 '18 at 14:19
  • Are you reading logfile or logfiles? I see one file, but could you not have more files to process? – thebluephantom Aug 01 '18 at 15:37
  • I have an answer for you but cannot place it here as it is a duplicate. – thebluephantom Aug 01 '18 at 15:45
  • It is a bit simpler than the other one – thebluephantom Aug 01 '18 at 15:46
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/177278/discussion-between-user1708054-and-thebluephantom). – user1708054 Aug 02 '18 at 11:45
  • hi All below is something i was trying .. hope it helps someone. val filesRDD = sc.textFile("/home/user/instrumentationLog.log",1); val FcoreZip=filesRDD.zipWithIndex().filter(s=>(s._2) >0); Val FcoreCol1= FcoreZip.map(s=>s._1); val linesRDD = FcoreZip.map(line => (line.trim.split(","))).map(entries=>(entries(0),entries(1).toInt,entries(2),entries(3))); val df = linesRDD.toDF("cola", "colb", "colc", "cold") df.show thanks @thebluephantom... i want to upvote you suggestions but able to do it only once. kindly let me know how to do it . – user1708054 Aug 02 '18 at 15:52

0 Answers0