Recently I wanted to do Spark Machine Learning Lab from Spark Summit 2016. Training video is here and exported notebook is available here.
The dataset used in the lab can be downloaded from UCI Machine Learning Repository. It contains a set of readings from various sensors in a gas-fired power generation plant. The format is xlsx file with five sheets.
To use the data in the lab I needed to read all the sheets form the Excel file and to concatenate them into one Spark DataFrame. During the training they are using Databricks Notebook but I was using IntelliJ IDEA with Scala and evaluating the code in the console.
The first step was to save all the Excel sheets into separate xlsx files named sheet1.xlxs
, sheet2.xlsx
etc. and put them into sheets
directory.
How to read all the Excel files and concatenate them into one Apache Spark DataFrame?