How to read multiple Excel files and concatenate them into one Apache Spark DataFrame?

Question

Recently I wanted to do Spark Machine Learning Lab from Spark Summit 2016. Training video is here and exported notebook is available here.

The dataset used in the lab can be downloaded from UCI Machine Learning Repository. It contains a set of readings from various sensors in a gas-fired power generation plant. The format is xlsx file with five sheets.

To use the data in the lab I needed to read all the sheets form the Excel file and to concatenate them into one Spark DataFrame. During the training they are using Databricks Notebook but I was using IntelliJ IDEA with Scala and evaluating the code in the console.

The first step was to save all the Excel sheets into separate xlsx files named sheet1.xlxs, sheet2.xlsx etc. and put them into sheets directory.

How to read all the Excel files and concatenate them into one Apache Spark DataFrame?

score 5 · Answer 1 · answered Mar 12 '17 at 14:38

For this I have used spark-excel package. It can be added to build.sbt file as : libraryDependencies += "com.crealytics" %% "spark-excel" % "0.8.2"

The code to execute in IntelliJ IDEA Scala Console was:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SparkSession, DataFrame}
import java.io.File

val conf = new SparkConf().setAppName("Excel to DataFrame").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")

val spark = SparkSession.builder().getOrCreate()

// Function to read xlsx file using spark-excel. 
// This code format with "trailing dots" can be sent to IJ Scala Console as a block.
def readExcel(file: String): DataFrame = spark.read.
  format("com.crealytics.spark.excel").
  option("location", file).
  option("useHeader", "true").
  option("treatEmptyValuesAsNulls", "true").
  option("inferSchema", "true").
  option("addColorColumns", "False").
  load()

val dir = new File("./data/CCPP/sheets")
val excelFiles = dir.listFiles.sorted.map(f => f.toString)  // Array[String]

val dfs = excelFiles.map(f => readExcel(f))  // Array[DataFrame]
val ppdf = dfs.reduce(_.union(_))  // DataFrame 

ppdf.count()  // res3: Long = 47840
ppdf.show(5)

Console output:

+-----+-----+-------+-----+------+
|   AT|    V|     AP|   RH|    PE|
+-----+-----+-------+-----+------+
|14.96|41.76|1024.07|73.17|463.26|
|25.18|62.96|1020.04|59.08|444.37|
| 5.11| 39.4|1012.16|92.14|488.56|
|20.86|57.32|1010.24|76.64|446.48|
|10.82| 37.5|1009.23|96.62| 473.9|
+-----+-----+-------+-----+------+
only showing top 5 rows

Not dure if it processes all tabs of an excel of not. Does it? — thebluephantom, Aug 11 '18 at 04:46

score 0 · Answer 2 · answered Dec 08 '17 at 19:58

We need spark-excel library for this, can be obtained from

https://github.com/crealytics/spark-excel#scala-api

clone the git project from above github link and build using "sbt package"
Using Spark 2 to run the spark-shell

spark-shell --driver-class-path ./spark-excel_2.11-0.8.3.jar --master=yarn-client

Import the necessary

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sqlContext = new SQLContext(sc)

Set excel doc path

val document = "path to excel doc"

Execute the below function for creating dataframe out of it

val dataDF = sqlContext.read
                          .format("com.crealytics.spark.excel")
                          .option("sheetName", "Sheet Name")
                          .option("useHeader", "true")
                          .option("treatEmptyValuesAsNulls", "false")
                          .option("inferSchema", "false")
                          .option("location", document)
                          .option("addColorColumns", "false")
                          .load(document)

That's all! now you can perform the Dataframe operation on the dataDF object.

Actually, I've to store a spark dataframe in a excel file format with few column as a read only nature? Can you guide me regarding the same? — kanishk kashyap, Apr 28 '22 at 19:37

score 0 · Answer 3 · edited Feb 22 '20 at 17:28

Hope this Spark Scala code might help.

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Path, FileSystem}
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.execution.datasources.InMemoryFileIndex
import java.net.URI

def listFiles(basep: String, globp: String): Seq[String] = {
  val conf = new Configuration(sc.hadoopConfiguration)
  val fs = FileSystem.get(new URI(basep), conf)

  def validated(path: String): Path = {
    if(path startsWith "/") new Path(path)
    else new Path("/" + path)
  }

  val fileCatalog = InMemoryFileIndex.bulkListLeafFiles(
    paths = SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
    hadoopConf = conf,
    filter = null,
    sparkSession = spark)

  fileCatalog.flatMap(_._2.map(_.path))
}

val root = "/mnt/{path to your file directory}"
val globp = "[^_]*"

val files = listFiles(root, globp)
val paths=files.toVector

Loop the vector to read multiple files:

for (path <- paths) {
     print(path.toString)

     val df= spark.read.
                   format("com.crealytics.spark.excel").
                   option("useHeader", "true").
                   option("treatEmptyValuesAsNulls", "false").
                   option("inferSchema", "false"). 
                   option("addColorColumns", "false").
                   load(path.toString)
}

How to read multiple Excel files and concatenate them into one Apache Spark DataFrame?

3 Answers3

Linked