0

I have a requirement to load various files (different type) into spark data frame. Are all these file formats supported by Databricks? If yes, where can I get the list of options supported for each file format?

delimited
csv
parquet
avro
excel
json

Thanks

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Garipaso
  • 391
  • 2
  • 8
  • 22

2 Answers2

1

I don't know exactly what Databricks offers out of the box (pre-installed), but you can do some reverse-engineering using org.apache.spark.sql.execution.datasources.DataSource object that is (quoting the scaladoc):

The main class responsible for representing a pluggable Data Source in Spark SQL

All data sources usually register themselves using DataSourceRegister interface (and use shortName to provide their alias):

Data sources should implement this trait so that they can register an alias to their data source.

Reading along the scaladoc of DataSourceRegister you'll find out that:

This allows users to give the data source alias as the format type over the fully qualified class name.

So, YMMV.

Unless you find an authoritative answer on Databricks, you may want to (follow DataSource.lookupDataSource and) use Java's ServiceLoader.load method to find all registered implementations of DataSourceRegister interface.

// start a Spark application with external module with a separate DataSource
$ ./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0-SNAPSHOT

import java.util.ServiceLoader
import org.apache.spark.sql.sources.DataSourceRegister

val formats = ServiceLoader.load(classOf[DataSourceRegister])

import scala.collection.JavaConverters._
scala> formats.asScala.map(_.shortName).foreach(println)
orc
hive
libsvm
csv
jdbc
json
parquet
text
console
socket
kafka

Where can I get the list of options supported for each file format?

That's not possible as there is no API to follow (like in Spark MLlib) to define options. Every format does this on its own...unfortunately and your best bet is to read the documentation or (more authoritative) the source code.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
-1

All these formats are supported by Spark, for Excel files you can use spark-excel library.