0

I am looking for way to construct the dataframe from an excel file in spark using scala? I referred below SO post and tried doing an operation for an excel sheet attached.

Excel sheet sample

How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?

Unfortunately, below modified code didn't read all the columns in an excel.

val df = spark.read.format("com.crealytics.spark.excel")
      .option("sheetName", "Sheet1") // Required
      .option("useHeader", "false") // Required
      .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
      .option("inferSchema", "true") // Optional, default: false
      .option("addColorColumns", "false") // Optional, default: false
      .option("startColumn", 0) // Optional, default: 0
      .option("endColumn", 99) // Optional, default: Int.MaxValue
      .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
      .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
      .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
      .option("path", excelFile)
      //.schema(customSchema)
      .load()

+---+---+--------------+---+---+
|_c0|_c1|           _c2|_c3|_c4|
+---+---+--------------+---+---+
|   |   |Test Profile 1|  A|123|
|   |   |Test Profile 2|  B|   |
|   |   |Test Profile 3|  C|   |
|   |   |Test Profile 4|  D|   |
|   |   |Test Profile 5|  E|   |
|   |   |Test Profile 6|  F|   |
+---+---+--------------+---+---+

Am I missing anything here?

My objective is to get all the data from a sheet which is randomly distributed and then get specific values out of it. Some of the cells can be blank.

I can do it in scala using apache poi, get the required values, convert into csv and then load in dataframe.

However, I am looking for a way to parse the excel sheet directly into dataframe using scala, iterate through dataframe rows and apply conditions to get the required rows/columns.

p.s. Sorry, I didnt know how to attach an excel file from my local machine.

Thanks!

baitmbarek
  • 2,440
  • 4
  • 18
  • 26
Dwarrior
  • 687
  • 2
  • 10
  • 26

1 Answers1

1

If you study the source code of crealytics spark excel, you will find that column numbers are defined with first row with value. And the first row with value in your excel file has file columns, so the last column which has value in other columns and not in the first row with value is neglected.

The solution to this would be to define a custom schema and pass it to to framework as

val customSchema = StructType(Seq(
  StructField("col0", StringType, true),
  StructField("col1", StringType, true),
  StructField("col2", StringType, true),
  StructField("col3", StringType, true),
  StructField("col4", IntegerType, true),
  StructField("col5", IntegerType, true)
  ))
val df = spark.read.format("com.crealytics.spark.excel")
  .option("sheetName", "Sheet1") // Required
  .option("useHeader", "false") // Required
  .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
  .option("inferSchema", "true") // Optional, default: false
  .option("addColorColumns", "false") // Optional, default: false
  .option("startColumn", 0) // Optional, default: 0
  .option("endColumn", 99) // Optional, default: Int.MaxValue
  .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
  .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
  .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
  .option("path", excelFile)
  .schema(customSchema)
  .load()

and you should have following dataframe

+----+----+--------------+----+----+----+
|col0|col1|col2          |col3|col4|col5|
+----+----+--------------+----+----+----+
|null|null|Test Profile 1|A   |123 |null|
|null|null|Test Profile 2|B   |null|null|
|null|null|Test Profile 3|C   |null|345 |
|null|null|Test Profile 4|D   |null|null|
|null|null|Test Profile 5|E   |null|null|
|null|null|Test Profile 6|F   |null|null|
+----+----+--------------+----+----+----+

I hope the answer is helpful

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • Hi Ramesh, It does give me the direction to proceed. So, if the scenario is like this - Cell C1 has a value called "profiles" and all the profiles listed above are underneath it, then should I use zipwithIndex to get the first row as header, rest all rows underneath it as the actual data to work with? Is that correct or can I use the defined methods in the source code you provided. I am still going through the source code to understand if I can leverage all the functions of iterating through cells and create new cells from it. – Dwarrior Jun 11 '18 at 15:02
  • what I can suggest is to use the process I have mentioed and then later you can filter out the row with that profiles word and then use that row for renaming the columns. – Ramesh Maharjan Jun 11 '18 at 15:09
  • @RameshMaharjan I'm working on a task, where I've to store spark dataframe in a excel file format with few column as a read only nature? Do you have any better approach regarding the same? – kanishk kashyap Apr 28 '22 at 20:35