2

I was trying to load the iris csv dataset in spark using the following code,

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

SparkSession spark = SparkSession
                .builder()
                .appName("Java Logistic Regression")
                .config("spark.master", "local")
                .getOrCreate();

Dataset<Row> training = spark.read().format("csv").option("header", "true").load("iris.csv");

training.show();

But I keep getting the Class not found error for csv... This is how the error looks like:

INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: csv. Please find packages at http://spark.apache.org/third-party-projects.html
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:660)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
        at com.bolzano.classify.Logistic_Regression.main(Logistic_Regression.java:107)
Caused by: java.lang.ClassNotFoundException: csv.DefaultSource
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:634)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:634)
        at scala.util.Failure.orElse(Try.scala:224)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
        ... 3 more

And I thought that spark 2.4.4 version has CSV Reader in built, hence I did not look-up how to use the databricks csv reader.

Here is how my maven's dependencies look like (I am using shade plugin for creating an uber-jar),

    <dependencies>
        <dependency>
            <groupId>net.imagej</groupId>
            <artifactId>ij</artifactId>
            <version>1.51n</version>
        </dependency>
        <dependency>
            <groupId>sc.fiji</groupId>
            <artifactId>Trainable_Segmentation</artifactId>
            <version>3.2.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.12</artifactId>
            <version>2.4.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>2.4.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>2.4.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.12</artifactId>
            <version>2.4.4</version>
        </dependency>
    </dependencies>
Vinay Bharadhwaj
  • 165
  • 1
  • 17
  • 1
    Even though https://stackoverflow.com/a/29705881/7109162 says that spark-csv is a part of spark-core, that doesn't seem to be the case for spark-core_2.12 v2.4.4. ` com.databricks spark-csv_2.11 1.5.0 ` seems to be a required dependency now – XtremeBaumer Dec 11 '19 at 11:32
  • @XtremeBaumer Oh that sucks, so after using that dependency should I use it the same way or is there a different way of importing csv files? – Vinay Bharadhwaj Dec 11 '19 at 11:34
  • As far as I can tell, the code stays the same, but you just gotta try and see what happens – XtremeBaumer Dec 11 '19 at 11:41
  • @XtremeBaumer No I am not able to do it that way. I had to make an SQLContext, but I was not able to run Logistic Regression on it because the fit method seems to call "features", but SQLContext has header names. This seems like a very poor implementation of ML. But thanks a lot for the comment – Vinay Bharadhwaj Dec 11 '19 at 13:41

0 Answers0