13

I'm very new to the concepts of Big Data and related areas, sorry if I've made some mistake or typo.

I would like to understand Apache Spark and use it only in my computer, in a development / test environment. As Hadoop include HDFS (Hadoop Distributed File System) and other softwares that only matters to distributed systems, can I discard that? If so, where can I download a version of Spark that doesn't need Hadoop? Here I can find only Hadoop dependent versions.

What do I need:

  • Run all features from Spark without problems, but in a single computer (my home computer).
  • Everything that I made in my computer with Spark should run in a future cluster without problems.

There's reason to use Hadoop or any other distributed file system for Spark if I will run it on my computer for testing purposes?

Note that "Can apache spark run without hadoop?" is a different question from mine, because I do want run Spark in a development environment.

Community
  • 1
  • 1
Paladini
  • 4,522
  • 15
  • 53
  • 96

2 Answers2

14

Yes you can install Spark without Hadoop. Go through Spark official documentation :http://spark.apache.org/docs/latest/spark-standalone.html

Rough steps :

  1. Download precomplied spark or download spark source and build locally
  2. extract TAR
  3. Set required environment variable
  4. Run start script .

Spark(without Hadoop) - Available on Spark Download page URL : https://www.apache.org/dyn/closer.lua/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz

If this url do not work then try to get it from Spark download page

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Pradeep Bhadani
  • 4,435
  • 6
  • 29
  • 48
  • 2
    Can you please be more specific about the "required environment variable"? I assume it's HADOOP_HOME_DIR, and I would like to know how to set it. I have successfully developed on Windows by downloading HadoopUtils and having HADOOP_HOME_DIR point there, but how should I set it on Linux? I am working on one Linux server where Hadoop is not installed. There is a Hadoop installation on another server. How should I set HADOOP_HOME_DIR? – radumanolescu May 30 '19 at 13:55
  • 1
    But it is a contradiction: "spark-2.2.0-bin-hadoop2.7.tgz" is **bin-hadoop2** and there are **bin-without-hadoop.tgz** option, so, something is wrong here. – Peter Krauss Sep 10 '19 at 19:08
0

This is not a proper answer to original question. Sorry, It is my fault.


If someone want to run spark without hadoop distribution tar.gz.

there should be environment variable to set. this spark-env.sh worked for me.

#!/bin/sh
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
ruseel
  • 1,578
  • 2
  • 21
  • 41
  • So in other words spark actually requires hadoop to run, hadoop can be installed either separately or downloaded bundled with spark, right? – Yar Dec 29 '22 at 14:12
  • Yes. Spark actually requires hadoop library to run. Spark has dependency for hadoop library. Yes. Hadoop library can be installed separately. Yes. Hadoop library is bundled in "spark with hadoop" version. Yes. Spark can run without hadoop cluster. – ruseel Mar 28 '23 at 05:38