I am looking for way to construct the dataframe from an excel file in spark using scala? I referred below SO post and tried doing an operation for an excel sheet attached.
How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?
Unfortunately, below modified code didn't read all the columns in an excel.
val df = spark.read.format("com.crealytics.spark.excel")
.option("sheetName", "Sheet1") // Required
.option("useHeader", "false") // Required
.option("treatEmptyValuesAsNulls", "false") // Optional, default: true
.option("inferSchema", "true") // Optional, default: false
.option("addColorColumns", "false") // Optional, default: false
.option("startColumn", 0) // Optional, default: 0
.option("endColumn", 99) // Optional, default: Int.MaxValue
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
.option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
.option("path", excelFile)
//.schema(customSchema)
.load()
+---+---+--------------+---+---+
|_c0|_c1| _c2|_c3|_c4|
+---+---+--------------+---+---+
| | |Test Profile 1| A|123|
| | |Test Profile 2| B| |
| | |Test Profile 3| C| |
| | |Test Profile 4| D| |
| | |Test Profile 5| E| |
| | |Test Profile 6| F| |
+---+---+--------------+---+---+
Am I missing anything here?
My objective is to get all the data from a sheet which is randomly distributed and then get specific values out of it. Some of the cells can be blank.
I can do it in scala using apache poi, get the required values, convert into csv and then load in dataframe.
However, I am looking for a way to parse the excel sheet directly into dataframe using scala, iterate through dataframe rows and apply conditions to get the required rows/columns.
p.s. Sorry, I didnt know how to attach an excel file from my local machine.
Thanks!