3

I am trying to run my code with spark-submit with the below command.

spark-submit --class "SampleApp" --master local[2] target/scala-2.11/sample-project_2.11-1.0.jar

And my sbt file is having below dependencies:

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.1"

libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "1.5.2"

libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.2.0"

My code :

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.collection.mutable.ArrayBuffer 
import org.apache.spark.sql.SQLContext

object SampleApp {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Sample App").setMaster("local[2]")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc) 

    import sqlContext._ 
    import sqlContext.implicits._

    val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "/root/input/Account.csv", "header" -> "true"))

    val column_names = df.columns
    val row_count = df.count
    val column_count = column_names.length

    var pKeys = ArrayBuffer[String]()

    for ( i <- column_names){
         if (row_count == df.groupBy(i).count.count){
             pKeys += df.groupBy(i).count.columns(0)
         }
     }

    pKeys.foreach(print)
  }
}

The error:

16/03/11 04:47:37 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
    at scala.sys.package$.error(package.scala:27)
    at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220)
    at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
    at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1253)

My Spark Version is 1.4.1 and Scala is 2.11.7

(I am following this link: http://www.nodalpoint.com/development-and-deployment-of-spark-applications-with-scala-eclipse-and-sbt-part-1-installation-configuration/)

I have tried below versions of spark csv

spark-csv_2.10 1.2.0
1.4.0 
1.3.1
1.3.0
1.2.0
1.1.0
1.0.3
1.0.2
1.0.1
1.0.0

etc.

Please help!

Daniel Zolnai
  • 16,487
  • 7
  • 59
  • 71
Venkataramana
  • 103
  • 2
  • 9
  • 1
    For starters your dependencies are messed up. SQL version should match core version. – zero323 Mar 11 '16 at 13:26
  • @zero323 thank you, I will try matching them. But its not able to load the data :( – Venkataramana Mar 11 '16 at 13:29
  • Next `SQLContext.load` method has been deprecated in 1.4.1. Use `DataFrameReader` methods instead. – zero323 Mar 11 '16 at 13:51
  • Also, have you build Spark with Scala 2.11? – zero323 Mar 11 '16 at 13:51
  • i used below command : build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Dscala-2.11 -DskipTests clean package Now changed SQL version to 1.4.1, will work on DataFrameReader Thanks for the suggestions! – Venkataramana Mar 11 '16 at 13:56
  • There are examples how to use reader in `spark-csv` README. Finally add `--packages com.databricks:spark-csv_...` to the `spark-submit` replacing `...` with Scala version and package version. – zero323 Mar 11 '16 at 14:03

5 Answers5

2

Since you are running the job in local mode, add external jar path using --jar option

spark-submit --class "SampleApp" --master local[2] --jar file:[path-of-spark-csv_2.11.jar],file:[path-of-other-dependency-jar] target/scala-2.11/sample-project_2.11-1.0.jar

e.g.

spark-submit --jars file:/root/Downloads/jars/spark-csv_2.10-1.0.3.jar,file:/root/Downloads/jars/com‌​mons-csv-1.2.jar,file:/root/Downloads/jars/spark-sql_2.11-1.4.1.jar --class "SampleApp" --master local[2] target/scala-2.11/my-proj_2.11-1.0.jar

Another thing you can do is create a fat jar. In SBT you can try this proper-way-to-make-a-spark-fat-jar-using-sbt and in Maven refer create-a-fat-jar-file-maven-assembly-plugin

Note: Mark scope of Spark's (i.e. spark-core, spark-streaming, spark-sql etc) jar as provided otherwise fat jar will become too fat to deploy.

Community
  • 1
  • 1
Mahendra
  • 1,436
  • 9
  • 15
  • this worked, thanks a lot, but now I am getting java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat\ – Venkataramana Mar 14 '16 at 07:45
  • The command used is : spark-submit --class "SampleApp" --master local[2] --jars file:/root/Downloads/jars/spark-csv_2.10-1.0.3.jar,/root/Downloads/jars/commons.csv-1.2.jar,/root/Downloads/jars/spark-sql_2.11-1.4.1.jar target/scala-2.11/my-proj_2.11-1.0.jar – Venkataramana Mar 14 '16 at 08:04
  • `spark-csv_2.11.jar` depends on `org.apache.commons » commons-csv` so you also have to add `commons-csv.jar` to. To solve this dependency issue I have updated the answer. – Mahendra Mar 14 '16 at 08:04
  • Try updated command : `spark-submit --class "SampleApp" --master local[2] --jars file://root/Downloads/jars/spark-csv_2.10-1.0.3.jar,file://root/Downloads/jars/commons.‌​csv-1.2.jar,file://root/Downloads/jars/spark-sql_2.11-1.4.1.jar target/scala-2.11/my-proj_2.11-1.0.jar` , added extra `/` for file protocol. – Mahendra Mar 14 '16 at 08:09
  • After several trails below command worked, file://root... is not able to read the location for the jar, please update the answer with the below syntax, Thanks a lot again! command used: spark-submit --jars file:/root/Downloads/jars/spark-csv_2.10-1.0.3.jar,file:/root/Downloads/jars/commons-csv-1.2.jar,file:/root/Downloads/jars/spark-sql_2.11-1.4.1.jar --class "SampleApp" --master local[2] target/scala-2.11/my-proj_2.11-1.0.jar – Venkataramana Mar 14 '16 at 10:21
1

Better solution is to use --packages option like below.

spark-submit --class "SampleApp" --master local[2] --packages com.databricks:spark-csv_2.10:1.5.0 target/scala-2.11/sample-project_2.11-1.0.jar

Make sure that --packages option precedes the application jar

deepdive
  • 9,720
  • 3
  • 30
  • 38
0

you have added spark-csv library to your sbt config - it means that you can compile your code with it,

but it still doesn't mean that it's present in runtime(spark sql and spark core are there by default)

so try to use --jars option of spark-submit to add spark-csv jar to runtime classpath or you can build fat-jar(not sure how you doing it with sbt)

Igor Berman
  • 1,522
  • 10
  • 16
  • He compiled it into his jar, so it should be there (note that he didn't use `provided`, which would indeed omit it) – Daniel Zolnai Mar 11 '16 at 15:46
  • @DanielZolnai, I might be wrong, but you need special plugin for this sbt-assembly, sbt-proguard, sbt-onejar. By default with simple 'sbt package' 3'd party jars will not be assembled into jar. – Igor Berman Mar 11 '16 at 16:09
  • You might be right, I assumed he was using assembly. – Daniel Zolnai Mar 11 '16 at 16:18
  • Thanks a lot for the solution, now I am getting java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat\ ,any idea on this? – Venkataramana Mar 14 '16 at 07:46
  • I have used: spark-submit --class "SampleApp" --master local[2] --jars file:/root/Downloads/jars/spark-csv_2.10-1.0.3.jar,/root/Downloads/jars/commons.csv-1.2.jar,/root/Downloads/jars/spark-sql_2.11-1.4.1.jar target/scala-2.11/my-proj_2.11-1.0.jar – Venkataramana Mar 14 '16 at 08:04
0

You are using the Spark 1.3 syntax of loading the CSV file into a dataframe.

If you check the repository here, you should use the following syntax on Spark 1.4 and higher:

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")
Daniel Zolnai
  • 16,487
  • 7
  • 59
  • 71
0

I was looking for an option where in I could skip the --packages option and provide it directly in the assembly jar. The reason I faced this exception was sqlContext.read.format("csv") which meant it should know the data format of csv. Instead, to specify where the format csv is present use sqlContext.read.format("com.databricks.spark.csv") so it knows where to look for it and does not throw an exception.

Rahul Midha
  • 101
  • 8