Which version of Spark to download?

Question

I understand you can download Spark source code (1.5.1), or prebuilt binaries for various versions of Hadoop. As of Oct 2015, the Spark webpage http://spark.apache.org/downloads.html has prebuilt binaries against Hadoop 2.6+, 2.4+, 2.3, and 1.X.

I'm not sure what version to download.

I want to run a Spark cluster in standalone mode using AWS machines.

<EDIT>

I will be running a 24/7 streaming process. My data will be coming from a Kafka stream. I thought about using spark-ec2, but since I already have persistent ec2 machines, I thought I might as well use them.

My understanding is that since my persistent workers need to perform checkpoint(), it needs to have access to some kind of shared file system with the master node. S3 seems like a logical choice.
</EDIT>

This means I need to access S3, but not hdfs. I do not have Hadoop installed.

I got a pre-built Spark for Hadoop 2.6. I can run it in local mode, such as the wordcount example. However, whenever I start it up, I get this message

WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Is this a problem? Do I need hadoop?

<EDIT>

It's not a show stopper but I want to make sure I understand the reason of this warning message. I was under the assumption that Spark doesn't need Hadoop, so why is it even showing up? </EDIT>

score 2 · Answer 1 · edited May 23 '17 at 10:31

I'm not sure what version to download.

This consideration will also be guided by what existing code you are using, features you require, and bug tolerance.

I want to run a Spark cluster in standalone mode using AWS instances.

Have you considered simply running Apache Spark on Amazon EMR? See also How can I run Spark on a cluster? from Spark's FAQ, and their reference to their EC2 scripts.

This means I need to access S3, but not hdfs

One does not imply the other. You can run a Spark cluster on EC2 instances perfectly fine, and never have to access S3. While many examples are written using S3 access through the out-of-the-box S3 "fs" drivers for the Hadoop library, pay attention that there are now 3 different access methods. Configure as appropriate.

However, your choice of libraries to load will depend on where your data is. Spark can access any filesystem supported by Hadoop, from which there are several to choose.

Is your data even in files? Depending on your application, and where your data is, you may only need to use Data Frame over SQL, Cassandra, or others!

However, whenever I start it up, I get this message

WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Is this a problem? Do I need hadoop?

Not a problem. It is telling you that it is falling back to a non-optimum implementation. Others have asked this question, too.

In general, it sounds like you don't have any application needs right now, so you don't have any dependencies. Dependencies are what would drive different configurations such as access to S3, HDFS, etc.

I can run it in local mode, such as the wordcount example.

So, you're good?

UPDATE

I've edited the original post

My data will be coming from a Kafka stream. ... My understanding is that .. my persistent workers need to perform checkpoint().

Yes, the Direct Kafka approach is available from Spark 1.3 on, and per that article, uses checkpoints. These require a "fault-tolerant, reliable file system (e.g., HDFS, S3, etc.)". See the Spark Streaming + Kafka Integration Guide for your version for specific caveats.

So why [do I see the Hadoop warning message]?

The Spark download only comes with so many Hadoop client libraries. With a fully-configured Hadoop installation, there are also platform-specific native binaries for certain packages. These get used if available. To use them, augment Spark's classpath; otherwise, the loader will fallback to less performant versions.

Depending on your configuration, you may be able to take advantage of a fully configured Hadoop or HDFS installation. You mention taking advantage of your existing, persistent EC2 instances, rather than using something new. There's a tradeoff between S3 and HDFS: S3 is a new resource (more cost) but survives when your instance is offline (can take compute down and have persisted storage); however, S3 might suffer from latency compared to HDFS (you already have the machines, why not run a filesystem over them?), as well as not behave like a filesystem in all cases. This tradeoff is described by Microsoft for choosing Azure storage vs. HDFS, for example, when using HDInsight.

thanks for the comment. i've edited the original post. really appreciate your help — user3240688, Oct 03 '15 at 01:34

score 1 · Answer 2 · answered Oct 03 '15 at 21:29

1

We're also running Spark on EC2 against S3 (via the s3n file system). We had some issue with the pre-built versions for Hadoop 2.x. Regrettably I don't remember what the issue was. But in the end we're running with the pre-built Spark for Hadoop 1.x and it works great.

answered Oct 03 '15 at 21:29

Daniel Darabos

26,991
10
102
114

Sorry about the anecdotal answer. Hope it's helpful anyway. – Daniel Darabos Oct 03 '15 at 21:29

Which version of Spark to download?

2 Answers2