i have input File (CSV) containing upto 20 columns. I have to filter input file based on number of columns. If row contains 20 columns then row is considered as good data else bad data.
Input File:
123456,"ID_SYS",12,"Status_code","feedback","HIGH","D",""," ",""," ","","
",9999," ",2013-05-02,9999-12-31,"N",1,2
I am reading file as RDD and splitting based on , and checking if row contains 20 columns
val rdd = SparkConfig.spark.sparkContext.textFile(CommonUtils.loadConf.getString("conf.inputFile"))
val splitRDD = rdd.map(line =>Row.fromSeq(line.split(",")))
val goodRDD = splitRDD.filter(arr => arr.size == 20)
I have to convert goodRDD into Dataframe?Dataset to apply some transformations I tried with below code
val rowRdd = splitRDD.map{
case Array(c1,c2,c3 .... c20) => Row(c1.toInt,c2....)
case _ => badCount++
}
val ds = SparkConfig.spark.sqlContext.createDataFrame(rowRdd
,inputFileSchema)
I have 20 columns , I hav to write down 20 columns in pattern matching? I would like to know best way for rite solution