2

I have a CSV file stored in local windows HDFS (hdfs://localhost:54310), under path /tmp/home/. I would like to load this file from HDFS to spark Dataframe. So I tried this

val spark = SparkSession.builder.master(masterName).appName(appName).getOrCreate()

and then

val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
import sparkSession.implicits._

spark.sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load(path)
  .show()

But fails at runtime with below exception Stack trace:

Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/test/sampleApp/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)

C:/test/sampleApp/ is the path where my sample project lies. But I have specified the HDFS path.

Additionally, this works perfectly fine with plain rdd

val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
val sc = SparkContext.getOrCreate()
val rdd = sc.textFile(path)
println(rdd.first()) //prints first row of CSV file

I found and tried this as well but no luck :(

I am missing something? Why spark is looking at my local file system & not the HDFS?

I am using spark 2.0 on hadoop-hdfs 2.7.2 with scala 2.11.

EDIT: Just one additional info I tried to downgrade to spark 1.6.2. I was able to make it work. So I think this is a bug in spark 2.0

Community
  • 1
  • 1
Aiden
  • 355
  • 5
  • 17

1 Answers1

0

Just to close the loop.This seems to be issue in spark 2.0 and a ticket has been raised.

https://issues.apache.org/jira/browse/SPARK-15899

Aiden
  • 355
  • 5
  • 17