spark-shell error : No FileSystem for scheme: wasb

Question

We have HDInsight cluster in Azure running, but it doesn't allow to spin up edge/gateway node at the time of cluster creation. So I was creating this edge/gateway node by installing

echo 'deb http://private-repo-1.hortonworks.com/HDP/ubuntu14/2.x/updates/2.4.2.0 HDP main' >> /etc/apt/sources.list.d/HDP.list
echo 'deb http://private-repo-1.hortonworks.com/HDP-UTILS-1.1.0.20/repos/ubuntu14 HDP-UTILS main'  >> /etc/apt/sources.list.d/HDP.list
echo 'deb [arch=amd64] https://apt-mo.trafficmanager.net/repos/azurecore/ trusty main' >> /etc/apt/sources.list.d/azure-public-trusty.list
gpg --keyserver pgp.mit.edu --recv-keys B9733A7A07513CAD
gpg -a --export 07513CAD | apt-key add -
gpg --keyserver pgp.mit.edu --recv-keys B02C46DF417A0893
gpg -a --export 417A0893 | apt-key add -
apt-get -y install openjdk-7-jdk
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
apt-get -y install hadoop hadoop-hdfs hadoop-yarn hadoop-mapreduce hadoop-client openssl libhdfs0 liblzo2-2 liblzo2-dev hadoop-lzo phoenix hive hive-hcatalog tez mysql-connector-java* oozie oozie-client sqoop flume flume-agent spark spark-python spark-worker spark-yarn-shuffle

Then I copied /usr/lib/python2.7/dist-packages/hdinsight_common/ /usr/share/java/ /usr/lib/hdinsight-datalake/ /etc/spark/conf/ /etc/hadoop/conf/

But when I run spark-shell I get following error

java.io.IOException: No FileSystem for scheme: wasb

Here is the full stack https://gist.github.com/anonymous/ebb6c9d71865c9c8e125aadbbdd6a5bc

I am not sure which package/jar is missing here.

Anyone has any clue what I am doing wrong ?

Thanks

I am looking for a solution to similar issues. Possible assist here: http://stackoverflow.com/questions/32264020/unable-to-connect-with-azure-blob-storage-with-local-hadoop — aaronsteers, Aug 23 '16 at 01:00

NicolasKittsteiner · Answer 1 · 2017-05-14T02:28:24.267

8

Another way of setting Azure Storage (wasb and wasbs files) in spark-shell is:

Copy azure-storage and hadoop-azure jars in the ./jars directory of spark installation.
Run the spark-shell with the parameters —jars [a comma separated list with routes to those jars] Example:
```
$ bin/spark-shell --master "local[*]" --jars jars/hadoop-azure-2.7.0.jar,jars/azure-storage-2.0.0.jar
```

Add the following lines to the Spark Context:


sc.hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc.hadoopConfiguration.set("fs.azure.account.key.my_account.blob.core.windows.net", "my_key")

Run a simple query:


sc.textFile("wasb://my_container@my_account_host/myfile.txt").count()

Enjoy :)

With this settings you could easily could setup a Spark application, passing the parameters to the 'hadoopConfiguration' on the current Spark Context

edited May 14 '17 at 02:28

answered Jan 09 '17 at 20:12

NicolasKittsteiner

4,280
1
20
17

My bad. I have to stop using Mac Notes to save code snippets :) – NicolasKittsteiner May 14 '17 at 02:30
Yep, much better now :) And a very good solution too, +1 from me. – Philip P. May 15 '17 at 14:50
8

`hadoopConfiguration.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")` did not work in for me.(Spark 2.3.1, Hadoop 2.7.3). I had to set `hadoopConfiguration.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")` instead. Now it's Ok. – noleto Oct 30 '18 at 12:28
@noleto Thank you for writing that comment! – akki Jun 26 '19 at 02:13

CatNinja · Answer 2 · 2016-08-25T18:22:09.990

Hai Ning from Microsoft has written an excellent blog post on to setup wasb on an apache hadoop installation.

Here is the summary:

Add hadoop-azure-*.jar and azure-storage-*.jar to hadoop classpath

1.1 Find the jars in your local installation. It's at /usr/hdp/current/hadoop-client folder on HDInsight cluster.

1.2 Update HADOOP_CLASSPATH variable at hadoop-env.sh. Use exact jar name as java classpath doesn't support partial wildcard.

Update core-site.xml

<property>         
        <name>fs.AbstractFileSystem.wasb.Impl</name>                           
        <value>org.apache.hadoop.fs.azure.Wasb</value> 
</property>

<property>
        <name>fs.azure.account.key.my_blob_account_name.blob.core.windows.net</name> 
        <value>my_blob_account_key</value> 
</property>

<!-- optionally set the default file system to a container --> 
<property>
        <name>fs.defaultFS</name>          
        <value>wasb://my_container_name@my_blob_account_name.blob.core.windows.net</value>
</property>

See exact steps here: https://github.com/hning86/articles/blob/master/hadoopAndWasb.md

Thanks for the suggestion, but for specific use case I cant use client deployed through cluster deployment. — roy, Jul 07 '16 at 21:05

spark-shell error : No FileSystem for scheme: wasb

2 Answers2

Linked