3

I installed the package spark-2.0.2-bin-without-hadoop.tgz on a local DEV box but failed to run it below,

$ ./bin/spark-shell
NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream

$ ./sbin/start-master.sh
NoClassDefFoundError: org/slf4j/Logger

Did I misinterpret that Spark could spin without Hadoop below?

"Do I need Hadoop to run Spark? No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode."

sof
  • 9,113
  • 16
  • 57
  • 83

2 Answers2

6

For the first issue concerning FSDataInputStream, as noted in this Stack Overflow response https://stackoverflow.com/a/31331528,

the "without Hadoop" is a bit misleading in that this build of Spark is not tied to a specific build of Hadoop as opposed to not running without it. To run Spark using the "without Hadoop" version, you should bind it to your own Hadoop distribution.

For the second issue concerning missing SLF4J, as noted in this Stack Overflow response https://stackoverflow.com/a/39277696 - you can include the SLF4J jar or if you already have a Hadoop distribution installed, then you should already have this up and running.

Saying this, you can download the Apache Spark pre-built with Hadoop and not use Hadoop itself. It contains all the necessary jars and you can specify Spark to read from the file system, e.g. Using the file://// when accessing your data (instead of HDFS).

Community
  • 1
  • 1
Denny Lee
  • 3,154
  • 1
  • 20
  • 33
  • I realized later, it wasn't an exact quote :) feel free to rollback – OneCricketeer Nov 23 '16 at 05:12
  • Nah - its close enough to warrant a quote, eh?! :) – Denny Lee Nov 23 '16 at 05:13
  • `Spark` runtime need firstly load the dependent classes on `Hadoop` even though not use them but access the local filesystem. Is it architecturally somehow short-sighted or merely a minor implementation smell? – sof Nov 23 '16 at 17:03
  • Neither - the reason to continue including Hadoop jars as a dependency is because the majority of Spark users use cloud file storage (which have dependencies on Hadoop) or HDFS itself as their persistent storage layer. – Denny Lee Nov 23 '16 at 17:56
2

Yes, from the downloads page of Spark, as of today, for Spark 3.1.1 the following package types exist for download:

  1. Pre-built for Apache Hadoop 2.7

This (spark-3.1.1-bin-hadoop2.7.tgz) version of spark runs with Hadoop 2.7

  1. Pre-built for Apache Hadoop 3.2 and later

This (spark-3.1.1-bin-hadoop3.2.tgz) version of spark runs with Hadoop 3.2 and later

  1. Pre-built with user-provided Apache Hadoop

This (spark-3.1.1-bin-without-hadoop.tgz) version of spark runs with any user-provided version of Hadoop.

From the name of last version (spark-3.1.1-bin-without-hadoop.tgz), it appears that we will need HADOOP for this spark version (i.e., 3.) and not the other versions (i.e., 1. and 2.). However, the naming is ambiguous. We will need Hadoop only if we want to support HDFS and YARN. In the Standalone mode, Spark can run in a truly distributed setting (or with daemons on a single machine) without Hadoop.

For 1. and 2., you can run Spark without a Hadoop installation as some of the core Hadoop libraries come bundled with the spark prebuilt binary, hence spark-shell would work without throwing any exceptions); for 3., spark will not work unless a HADOOP installation is provided (as 3. comes without the Hadoop runtime).

In essence,

  • we will need to install Hadoop separately in all three cases (1., 2., and 3.) if we want to support HDFS and YARN
  • if we don't want to install Hadoop, we can use pre-built Spark with hadoop and run Spark in Standalone mode
  • if we want to use any version of Hadoop with Spark, then 3. should be used with a separate installation of Hadoop

For more information, refer this from the docs

There are two variants of Spark binary distributions you can download. One is pre-built with a certain version of Apache Hadoop; this Spark distribution contains built-in Hadoop runtime, so we call it with-hadoop Spark distribution. The other one is pre-built with user-provided Hadoop; since this Spark distribution doesn’t contain a built-in Hadoop runtime, it’s smaller, but users have to provide a Hadoop installation separately. We call this variant no-hadoop Spark distribution. For with-hadoop Spark distribution, since it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not populate Yarn’s classpath into Spark ...

sherminator35
  • 189
  • 1
  • 10