4

I have setup Databricks Connect so that I can develop locally and get Intellij goodies while at the same time leverage the power of a big Spark cluster on Azure Databricks.

When I want to read or write to Azure Data Lake spark.read.csv("abfss://blah.csv) I get the following

xception in thread "main" java.io.IOException: No FileSystem for scheme: abfss
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:547)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.immutable.List.flatMap(List.scala:355)
    at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:618)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:467)

From this I had the impression that it won't be a problem to reference Azure Data Lake locally since the code is executed remotely. Apparently I am mistaken.

Does anyone have a solution to this problem?

zaxme
  • 1,065
  • 11
  • 29
  • It looks like you are not running databricks connect and are just executing pyspark locally. Have you pyspark installed as well? – simon_dmorias Feb 28 '20 at 21:12
  • I am running Scala code actually. I don’t have any spark locally installed. However I do specify that I am starting a local SparkSession. That is specified in the tutorial. Is that wrong? Also what does it mean to run databricks connect? They are just a set of jars that are imported in the project, right? – zaxme Mar 01 '20 at 07:42
  • it’s a python library that must be installed and configured first. You can then import the jars instead of the usual spark libraries. https://docs.databricks.com/dev-tools/databricks-connect.html – simon_dmorias Mar 01 '20 at 07:51
  • Yes. I have done everything as per this very tutorial. At least that’s what I think. What is your suggestion? – zaxme Mar 01 '20 at 12:28
  • @simon_dmorias if I execute the `databricks-connect test` command, I am able to see the output in the Spark UI on databricks. This means I am capable of connecting to the cluster. The question is how to bypass the parsing of the `abfss` path. I hope that it is possible. – zaxme Mar 02 '20 at 09:10

1 Answers1

1

The reason for the problem was that I wanted to have the sources of Spark and be able to execute the workloads on Databricks. Unfortunately databricks-connect jars don't contain sources. So that means that I need to manually import them in the project. And here is the rub - exactly like it says in the docs:

... If this is not possible, make sure that the JARs you add are at the front of the classpath. In particular, they must be ahead of any other installed version of Spark (otherwise you will either use one of those other Spark versions and run locally ...

I did just that.

enter image description here

Now I am able to bake my cake and eat it!

The only problem is that if I add new dependencies I have to this this reordering once again.

zaxme
  • 1,065
  • 11
  • 29
  • Could u provide the list of the jars needed and where to download them? – Nabarun Dey Oct 06 '20 at 09:20
  • @NabarunDey if you open the docs link from above you will see instructions on how to get the jars needed. In short - you need to download the databricks-connect cli. Depending on which version you download - that has to align with your databricks cluster runtime - you will get a set of jars that can be accessed as per the link. You can find the command on how to add them to your classpath in the same link, – zaxme Oct 07 '20 at 19:14