Apache Spark : how to read from hdfs file

Question

I have locally installed spark 2.3.0 and using pyspark. I'm able to work with processing local files without any problem.

But if i have to read from hdfs, i'm not able to.

I'm confused with how spark access hadoop files. while installing spark, I'm asked to copy the winutil. I don't understand what is the role of winutil.

Should we bring up the hadoop services first , to work with spark ? Getting java.lang.UnsatisfiedLinkError errors if i use the hadoop installed externally and tried to use it in the spark. any pointer to right docuementation will be great help.

Thanks, Kiran

If you used pip to install PySpark, it doesn't come with any Hadoop libraries. And of course Hadoop servers needs to be running (remotely, not locally) to access HDFS files — OneCricketeer, Jun 17 '18 at 05:37
go through https://stackoverflow.com/questions/34697744/spark-1-6-failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path and you should understand why winutils is needed in windows — Ramesh Maharjan, Jun 17 '18 at 07:39
Thank you for the help..i tried this and worked.. thanks again — kirantd, Jun 20 '18 at 13:43

Imrul · Answer 1 · 2018-06-17T06:35:03.690

If you're using spark-submit to run the application in cluster mode, then it can take a flag --files which is used to pass down files from driver node to workers. I believe the reason you were able to run in local mode was because your driver and worker are in same machine however in cluster mode the driver and workers possibly are in separate machines. Spark needs to know in that case which files to send over to worker nodes. The follow flags are available as described in the book Learning Spark by Holden Karau; Andy Konwinski; Patrick Wendell; Matei Zaharia

--master
Indicates the cluster manager to connect to. The options for this flag are described in Table 7-1.

--deploy-mode
Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”). In client mode spark-submit will run your driver on the same machine where spark-submit >s itself being invoked. In cluster mode, the driver will be shipped to execute on a worker node in the cluster. The default is client mode.

--class
The “main” class of your application if you’re running a Java or Scala program.

--name
A human-readable name for your application. This will be displayed in Spark’s web UI.

--jars
A list of JAR files to upload and place on the classpath of your application. If your application depends on a small number of third-party JARs, you can add them here.

--files
A list of files to be placed in the working directory of your application. This can be used for data files that you want to distribute to each node.

--py-files
A list of files to be added to the PYTHONPATH of your application. This can contain .py, .egg, or .zip files.

--executor-memory
The amount of memory to use for executors, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).

--driver-memory
The amount of memory to use for the driver process, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).

Update I assumed that Kiran has Hadoop setup (as he mentioned externally) and was not able to make the program read from the HDFS programatically. If that was not the case, please ignore the answer.

This doesn't really answer about HDFS files – OneCricketeer Jun 17 '18 at 05:41 — OneCricketeer, Jun 17 '18 at 05:41

Apache Spark : how to read from hdfs file

1 Answers1