0

I have below code at cluster:

def main(args: Array[String]) {
    val spark = SparkSession.builder.appName("SparkData").getOrCreate()
    val sc = spark.sparkContext
    sc.setLogLevel("ERROR")
    import spark.implicits._
    import spark.sql
    //----------Write Logic Here--------------------------
    //Read csv file
    val df = spark.read.format("csv").load("books.csv")//Here i want to accept parameter
    df.show()
   spark.stop
}

I want to pass different files to spark.read.format using spark-submit command. The files are on my linux box. I used this :

csv_file="/usr/usr1/Test.csv"

spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files  myprop.properties,${csv_file} \
  abc.jar
 

Howevr the program just tries to look for the path from root folder from hdfs cluseter and says no file found exception. Can anyone please help me getting used the file from the filepath I mention. So i want my spark program to read the file from the path I say. Not from the root.

I tried:

  def main(args: Array[String]) {
            val spark = SparkSession.builder.appName("SparkData").getOrCreate()
            val sc = spark.sparkContext
            sc.setLogLevel("ERROR")
            import spark.implicits._
            import spark.sql
             val filepath = args(0)
            //----------Write Logic Here--------------------------
            //Read csv file
            val df = spark.read.format("csv").load(filepath)//Here i want to accept parameter
            df.show()
           spark.stop
        }

Used below to submit which doesnt work:

csv_file="/usr/usr1/Test.csv"

spark2-submit \
--num-executors 30 \
--driver-memory 12g \
--executor-memory 14g \
--executor-cores 4 \
--class driver_class \
--name TTTTTT \
--master yarn \
--deploy-mode cluster \
--files  myprop.properties \
  abc.jar  ${csv_file}

But program is not picking the fie. Can anyone please help?

VnS
  • 27
  • 1
  • 6

3 Answers3

0

The local files URL format should be: csv_file="file:///usr/usr1/Test.csv".

Note that the local files must also be accessible at the same path on all worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

过过招
  • 3,722
  • 2
  • 4
  • 11
  • using my code file is being read but spark upload the file from the Unixbox path to hdfs location at `user/usr1/.sparkStaging/application_XXXXX/` folder. However spark.read.format looks for the file at path `user/usr1/` and this is root folder – VnS Sep 29 '21 at 07:23
  • 1
    I don't think that's true: adding "file://" should not be required (it's the default) and the csv file should not be on all worker, only the master – Juh_ Sep 29 '21 at 07:27
  • ah, maybe you meant on all worker because you don't know where yarn will start the driver? – Juh_ Sep 29 '21 at 07:28
  • Yes,In culster mode, all nodes may become drivers and executors. If the file is located on HDFS, give its HDFS absolute path. – 过过招 Sep 29 '21 at 07:47
0

I don't have a cluster on my hand right now, so I cannot test it. However:

  • You submit code to yarn, so it will deploy the spark driver on one of the cluster's node. But you don't know which.
  • When reading a file type path starting by "file://" or nothing, spark will look for a file on the file system of the node the driver is running on.
  • as you've seen using sparp-submit --file will copy the file in the starting folder of spark driver (so on the master node). The path is king of arbitrary, and you should not try to infer it.

But maybe it'd work to pass as argument to spark.read just the filename at let spark driver look for it in its starting folder (but I didn't check):

spark-submit\
  ...\
  --files ..., /path/to/your/file.csv\
  abs.jar file.csv

=> The proper/standard way to do it is: first copy you file(s) on hdfs, or other distributed file system the spark cluster has access to. Then, you can give to the spark app the hdfs file path to use. Something like (again, didn't test it)

   hdfs fs -put /path/to/your/file.csv /user/your/data
   spark-submit ... abc.jar hdfs:///user/your/data/file.csv

For info, if you don't know: to use hdfs command, you need to have hdfs client install on you machine (the actual hdfs command), with the suitable configuration to point to the hdfs cluster. Also there are usually security config to do on the cluster for the client to communicate with it. But that another issue that depends hdfs is running (local, aws, ...)

Juh_
  • 14,628
  • 8
  • 59
  • 92
  • I tried both. Neither works – VnS Sep 29 '21 at 08:17
  • I might have made a mistake in the code, but the proposed solution (2nd one) is the way to go. When using hdfs, check that the file is actually properly copied there. Then, in the spark app, log the argument passed to spark.read. Finally, what error is outputing spark? "file not found"? – Juh_ Sep 29 '21 at 08:59
  • Also, I don't work with csv anymore. Maybe you should only give the path without filename to spark.read. But then, be sure there is nothing else in the hdfs directory. – Juh_ Sep 29 '21 at 09:00
  • Also, check that spark has proper access the same hdfs cluster you put your file in. You can list hdfs content from the spark app (and log it). See for example this: https://stackoverflow.com/q/33394884/1206998 – Juh_ Sep 29 '21 at 09:09
  • I already implemented second solution which is working fine. However, i was looking for some option which i can use with spark-submit – VnS Sep 29 '21 at 10:01
  • I understand, but still I would recommend to stick to the hdfs solution. Maybe, if you want to send some data with you code, you could package it with the resources. Hmm, but I don't know if spark can read dataset inside a deployed jar... Probably not – Juh_ Sep 30 '21 at 12:32
0

Replace ${csv_file} at the end of your spark-submit command with basename ${csv_file}:

spark2-submit \
  ... \
  --files myprop.properties,${csv_file} \
  abc.jar `basename ${csv_file}`

basename strips the directory part from the full path leaving only the file name:

$ basename /usr/usr1/foo.csv
foo.csv

That way Spark will copy the file to the staging directory and the driver program should be able to access it by its relative path. If the cluster is configured to stage on HDFS, the executors will also have access to the file.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • Its not working. Still spark is looking the file at its root folder but the files is being copied from local unixbox to hdfs location at `user/usr1/.sparkStaging/application_XXXXX/` folder. However `spark.read.format` looks for the file at path `user/usr1/` and this is root folder – VnS Sep 29 '21 at 08:42
  • @VnS It works for me, but I'm reading the file in the driver. Looks like the best and most portable solution is to upload the file to HDFS and pass the full HDFS path. – Hristo Iliev Sep 29 '21 at 15:10