Java Spark disable Hadoop discovery

Question

I am currently running a Java application that uses Spark.

Everything works fine, except at the initialization of the SparkContext. At this moment, Spark try to discover Hadoop on my system, and throws and error as I don't have AND I DON'T WANT to install Hadoop :

2018-06-20 10:00:27.496 ERROR 4432 --- [           main] org.apache.hadoop.util.Shell             : Failed to locate the winutils binary in the hadoop binary path

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

Here is my SparkConfig :

SparkConf cfg = new SparkConf();

cfg.setAppName("ScalaPython")
        .setMaster("local")
        .set("spark.executor.instances", "2");

return cfg;

My Spark dependencies :

<!-- Spark dependencies -->
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.3.0</version>
    <exclusions>
        <exclusion>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
        </exclusion>
        <exclusion>
            <groupId>log4j</groupId>
            <artifactId>log4j</artifactId>
        </exclusion>
    </exclusions>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>${spark.version}</version>
</dependency>

<dependency>
    <groupId>org.datasyslab</groupId>
    <artifactId>geospark_2.3</artifactId>
    <version>1.1.0</version>
    <scope>provided</scope>
</dependency>

<dependency>
    <groupId>org.datasyslab</groupId>
    <artifactId>geospark-sql_2.3</artifactId>
    <version>1.1.0</version>
</dependency>

So is there a way to disable Hadoop discovery programmatically (ie: give SparkConfig a specific property), as this error doesn't block Spark context creation (I can still use Spark functionality) ?

N.B. It's for testing purposes.

Thanks for your answers !

Possible duplicate of [java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. spark Eclipse on windows 7](https://stackoverflow.com/questions/35652665/java-io-ioexception-could-not-locate-executable-null-bin-winutils-exe-in-the-ha) — philantrovert, Jun 20 '18 at 09:19
I want to disable the Hadoop discovery, not find a way to trick Spark. — Wykiki, Jun 21 '18 at 08:30

score 2 · Accepted Answer · answered Jun 21 '18 at 08:56

So the final "trick" I've used is a mix of sandev and Vipul answers.

Create a 'fake' winutils in your project root :

mkdir <java_project_root>/bin
touch <java_project_root>/bin/winutils.exe

Then, in your Spark configuration, provide the 'fake' HADOOP_HOME :

 public SparkConf sparkConfiguration() {
    SparkConf cfg = new SparkConf();
    File hadoopStubHomeDir = new File(".");

    System.setProperty("hadoop.home.dir", hadoopStubHomeDir.getAbsolutePath());
    cfg.setAppName("ScalaPython")
            .setMaster("local")
            .set("spark.executor.instances", "2");

    return cfg;
}

But still, it's a 'trick' to avoid Hadoop discovery, but it doesn't turn it off.

There should be a startup flag to simply disable the check for Hadoop-bin. — Time Killer, Aug 11 '19 at 03:39

score 0 · Answer 2 · answered Jun 20 '18 at 09:04

0

Just spark need winutils just create a folder example C:\hadoop\bin\winutils.exe and define inveroiment variable HADOOP_HOME = C:\hadoop and append to path variable C:\hadoop\bin.then u can use spark functionality

answered Jun 20 '18 at 09:04

sandevfares

225
5
16

So I've put a blank file as winutils.exe and it worked, no more ugly stack trace shown. But I'm seeking for a *programmatic* solution, for example a property to give to Spark to disable Hadoop discovery. – Wykiki Jun 20 '18 at 09:50

score 0 · Answer 3 · answered Jun 20 '18 at 09:15

It's not because spark wants hadoop to be installed or it just wants that particular file.

First, You have to run the code with spark-submit, are you doing that? Please stick to that as a first approach since that would yield list library-related issues. After you've done that you can add this to your pom file to be able to run it directly from the IDE, I use IntelliJ but should work on eclipse as well

<dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.6.5</version>
</dependency>

Second, if it still doesn't work:

Download the winutils file from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe.
create a new directory named bin inside some_other_directory
in your code add this line before creating the Context.

System.setProperty("hadoop.home.dir", "full path to some_other_directory");

Pro tip, switch to using Scala. Not that it's necessary but that's where spark feels most at home and it wouldn't take you more than a day or two to get the basic programs running just right.

If I understand well your *First* option, I must have a running Spark Cluster, and it's not the case here. The *Second* option is working, as already notifed in sandev answer, but I have to configure things outside of the Java project, and I don't want that. For Scala, I would "get rid" of Java this way, but it's currently not an option. — Wykiki, Jun 20 '18 at 11:18

Java Spark disable Hadoop discovery

3 Answers3