0

I'm very new to Spark and Scala(Like two hours new), I'm trying to play with a CSV data file but I cannot do it as I'm not sure how to deal with "Header row", I have searched internet for the way to load it or to skip it but I don't really know how to do that. I'm pasting my code That I'm using, please help me.

object TaxiCaseOne{

case class NycTaxiData(Vendor_Id:String, PickUpdate:String, Droptime:String, PassengerCount:Int, Distance:Double, PickupLong:String, PickupLat:String, RateCode:Int, Flag:String, DropLong:String, DropLat:String, PaymentMode:String, Fare:Double, SurCharge:Double, Tax:Double, TripAmount:Double, Tolls:Double, TotalAmount:Double)

def mapper(line:String): NycTaxiData = {
val fields = line.split(',')  

val data:NycTaxiData = NycTaxiData(fields(0), fields(1), fields(2), fields(3).toInt, fields(4).toDouble, fields(5), fields(6), fields(7).toInt, fields(8), fields(9),fields(10),fields(11),fields(12).toDouble,fields(13).toDouble,fields(14).toDouble,fields(15).toDouble,fields(16).toDouble,fields(17).toDouble)
return data
}def main(args: Array[String]) {

// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)
 // Use new SparkSession interface in Spark 2.0
val spark = SparkSession
  .builder
  .appName("SparkSQL")
  .master("local[*]")
  .config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
  .getOrCreate()
val lines = spark.sparkContext.textFile("../nyc.csv")

val data = lines.map(mapper)

// Infer the schema, and register the DataSet as a table.
import spark.implicits._
val schemaData = data.toDS

schemaData.printSchema()

schemaData.createOrReplaceTempView("data")

// SQL can be run over DataFrames that have been registered as a table
val vendor = spark.sql("SELECT * FROM data WHERE Vendor_Id == 'CMT'")

val results = teenagers.collect()

results.foreach(println)

spark.stop()
  }
}
Community
  • 1
  • 1
  • You should load it as `csv` and not as `textFile`...documentation: https://stackoverflow.com/questions/29704333/spark-load-csv-file-as-dataframe/39533431#39533431 – UninformedUser Jul 23 '17 at 19:37

1 Answers1

0

If you have a CSV file you should use spark-csv to read the csv files rather than using textFile

val spark = SparkSession.builder().appName("test val spark = SparkSession
  .builder
  .appName("SparkSQL")
  .master("local[*]")
  .config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
  .getOrCreate()

val df = spark.read
        .format("csv")
        .option("header", "true") //This identifies first line as header
        .csv("../nyc.csv")

You need a spark-core and spark-sql dependency to work with this

Hope this helps!

koiralo
  • 22,594
  • 6
  • 51
  • 72