4

I am trying to create a Spark application running on Scala that reads a .csv file that is located in src/main/resources directory and saves it on the local hdfs instance. Everything works charming when I run it locally, whenever I bundle it as a .jar file however and deploy it on a server something goes wrong...

This is my code that that is located in src/main/scala, the location of my datafile is src/main/resources/dataset.csv:

val df = spark.read
  .format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load(getClass.getResource("dataset.csv").toString())

When I make a jar by calling sbt package and deploy this to my server however, I receive the following error:

Exception in thread "main" java.lang.IllegalArgumentException: 
java.net.URISyntaxException: 
Relative path in absolute URI: jar:file:/root/./myapp_2.11-0.1.jar!/dataset.csv

How can I correctly link to my file?

hY8vVpf3tyR57Xib
  • 3,574
  • 8
  • 41
  • 86
  • Can you just store the csv file in HDFS and read it from your Spark job and then write it back out? This seems like a better design, to separate the data from the app that processes it. – MBillau Mar 07 '19 at 17:01
  • Possible duplicate of [How do I use Java getResource() to get a resource from a parent directory?](https://stackoverflow.com/questions/14389731/how-do-i-use-java-getresource-to-get-a-resource-from-a-parent-directory) – abiratsis Mar 07 '19 at 23:18

3 Answers3

5

Use getPath() on the URL object returned from getResource to get an absolute path:

getClass.getResource("data.csv").getPath()

Like so:

/upload-data-scala-project/target/scala-2.11/classes/data.csv

Using toString will give you a string representation of the URL like:

file:/upload-data-scala-project/target/scala-2.11/classes/data.csv

which has no leading slash, and is thus interpreted as an relative path.

minikomi
  • 8,363
  • 3
  • 44
  • 51
  • I receive "21/02/24 10:53:27 ERROR yarn.Client: Application diagnostics message: User class threw exception: java.lang.NullPointerException" when executed in Yarn. – Andre Vieira de Lima Feb 24 '21 at 13:57
  • 2
    @AndreVieiradeLima I'm not sure why the above isn't working, but I got that same error you did. I changed it to `getClass.getClassLoader.getResource(filename).getPath` and it came through. – davidshere Jul 25 '21 at 03:31
  • @davidshere I solved the error by adding a leading slash: `getClass.getResource(s"/$filename").getPath` – Jeremy Aug 21 '22 at 17:10
0

When you have a path in your resources and deploy the code in cluster, the resources folder will be somewhere based on configuration path you provided in your code deploy set up Accordingly, you can specify that file by referring to the complete path of the resources folder

Erick
  • 1
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 16 '22 at 14:49
-4

From the error message, it looks like spark is expecting an absolute path and you are giving a relative path to the file. I always provide an absolute path to the file (hdfs:// if the file is in HDFS or file:// if the file is local). Sample code below.

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.appName("My spark app").config("master","yarn").getOrCreate()
import spark.implicits._
val df = spark.read.json("hdfs:///user/amalprakash32203955/data/people.json")