Error using pyspark with WASB/Connecting Pyspark with Azure Blob

Question

I'm currently working on connecting an Azure blob with Pyspark and am encountering difficulties getting the two connected and running. I have installed both required jar files (hadoop-azure-3.2.0-javadoc.jar and azure-storage-8.3.0-javadoc.jar). I set them to be read in my sparkConf by using SparkConf().setAll() and once the I start the session I use:

spark._jsc.hadoopConfiguration().set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")

spark._jsc.hadoopConfiguration().set("fs.azure.account.key.acctname.blob.core.windows.net", "key")

sdf = spark.read.parquet("wasbs://container@acctname.blob.core.windows.net/")

but it always returns

java.io.IOException: No FileSystem for scheme: wasbs

Any thoughts?

I've followed the following:

https://github.com/Azure/mmlspark/issues/456

PySpark java.io.IOException: No FileSystem for scheme: https

spark-shell error : No FileSystem for scheme: wasb

import findspark

findspark.init('dir/spark/spark-2.4.0-bin-hadoop2.7')

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql.functions import *
from pyspark.sql import SQLContext

conf = SparkConf().setAll([(u'spark.submit.pyFiles', u'/dir/.ivy2/jars/hadoop-azure-3.2.0-javadoc.jar,/dir/.ivy2/jars/azure-storage-8.3.0-javadoc.jar,/dir/.ivy2/jars/com.twitter_jsr166e-1.1.0.jar,/dir/.ivy2/jars/io.netty_netty-all-4.0.33.Final.jar,/dir/.ivy2/jars/commons-beanutils_commons-beanutils-1.9.3.jar,/dir/.ivy2/jars/joda-time_joda-time-2.3.jar,/dir/.ivy2/jars/org.joda_joda-convert-1.2.jar,/dir/.ivy2/jars/org.scala-lang_scala-reflect-2.11.12.jar,/dir/.ivy2/jars/commons-collections_commons-collections-3.2.2.jar'), (u'spark.jars', u'file:///dir/.ivy2/jars/com.twitter_jsr166e-1.1.0.jar,file:///dir/.ivy2/jars/io.netty_netty-all-4.0.33.Final.jar,file:///dir/.ivy2/jars/commons-beanutils_commons-beanutils-1.9.3.jar,file:///dir/.ivy2/jars/joda-time_joda-time-2.3.jar,file:///dir/.ivy2/jars/org.joda_joda-convert-1.2.jar,file:///dir/.ivy2/jars/org.scala-lang_scala-reflect-2.11.12.jar,file:///dir/.ivy2/jars/commons-collections_commons-collections-3.2.2.jar'), (u'spark.app.id', u'local-1553969107475'), (u'spark.driver.port', u'38809'), (u'spark.executor.id', u'driver'), (u'spark.app.name', u'PySparkShell'), (u'spark.driver.host', u'test-VM'), (u'spark.sql.catalogImplementation', u'hive'), (u'spark.rdd.compress', u'True'),(u'spark.serializer.objectStreamReset', u'100'), (u'spark.master', u'local[*]'), (u'spark.submit.deployMode', u'client'), (u'spark.repl.local.jars', u'file:///dir/.ivy2/jars/com.twitter_jsr166e-1.1.0.jar,file:///dir/.ivy2/jars/io.netty_netty-all-4.0.33.Final.jar,file:///dir/.ivy2/jars/commons-beanutils_commons-beanutils-1.9.3.jar,file:///dir/.ivy2/jars/joda-time_joda-time-2.3.jar,file:///dir/.ivy2/jars/org.joda_joda-convert-1.2.jar,file:///dir/.ivy2/jars/org.scala-lang_scala-reflect-2.11.12.jar,file:///dir/.ivy2/jars/commons-collections_commons-collections-3.2.2.jar'), (u'spark.files', u'file:///dir/.ivy2/jars/com.twitter_jsr166e-1.1.0.jar,file:///dir/.ivy2/jars/io.netty_netty-all-4.0.33.Final.jar,file:///dir/.ivy2/jars/commons-beanutils_commons-beanutils-1.9.3.jar,file:///dir/.ivy2/jars/joda-time_joda-time-2.3.jar,file:///dir/.ivy2/jars/org.joda_joda-convert-1.2.jar,file:///dir/.ivy2/jars/org.scala-lang_scala-reflect-2.11.12.jar,file:///dir/.ivy2/jars/commons-collections_commons-collections-3.2.2.jar,file:///dir/.ivy2/jars/azure-storage-8.3.0-javadoc.jar,file:///dir/.ivy2/jars/hadoop-azure-3.2.0-javadoc.jar'), (u'spark.ui.showConsoleProgress', u'true')])

sc = SparkContext(conf=conf)
spark = SparkSession(sc)

spark._jsc.hadoopConfiguration().set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")

spark._jsc.hadoopConfiguration().set("fs.azure.account.key.acctname.blob.core.windows.net", "key")

sdf = spark.read.parquet("wasbs://container@acctname.blob.core.windows.net/")

Returns

java.io.IOException: No FileSystem for scheme: wasbs

You may be missing the directory at the end of the wasb path: wasbs://container@acctname.blob.core.windows.net/ If isn't it, I am in a very similar issue — Flavio Pegas, Jul 22 '19 at 23:14

Error using pyspark with WASB/Connecting Pyspark with Azure Blob

0 Answers0

Linked