I was trying to load the iris csv dataset in spark using the following code,
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
SparkSession spark = SparkSession
.builder()
.appName("Java Logistic Regression")
.config("spark.master", "local")
.getOrCreate();
Dataset<Row> training = spark.read().format("csv").option("header", "true").load("iris.csv");
training.show();
But I keep getting the Class not found error for csv... This is how the error looks like:
INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: csv. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:660)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at com.bolzano.classify.Logistic_Regression.main(Logistic_Regression.java:107)
Caused by: java.lang.ClassNotFoundException: csv.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:634)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:634)
at scala.util.Failure.orElse(Try.scala:224)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
... 3 more
And I thought that spark 2.4.4 version has CSV Reader in built, hence I did not look-up how to use the databricks csv reader.
Here is how my maven's dependencies look like (I am using shade plugin for creating an uber-jar),
<dependencies>
<dependency>
<groupId>net.imagej</groupId>
<artifactId>ij</artifactId>
<version>1.51n</version>
</dependency>
<dependency>
<groupId>sc.fiji</groupId>
<artifactId>Trainable_Segmentation</artifactId>
<version>3.2.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.12</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>2.4.4</version>
</dependency>
</dependencies>