How to read excel (.xlsx) file into a pyspark dataframe

Question

I have an excel file (.xlsx) file in the datalake. I need to read that file into a pyspark dataframe. I do no want to use pandas library.

I have installed the crealytics library in my databricks cluster and tried with below code:

dbutils.fs.cp('/path/to/excel/file','/FileStore/tables/',True)

path='/dbfs/FileStore/tables//myfile1.xlsx'

excel_df=spark.read.format("com.crealytics.spark.excel").option("header","true").option("inferSchema","true").load("/FileStore/tables/myfile1.xlsx")

Im getting the below error:

java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.byteArray(I)[B

Am I missing anything here or any other approach can be tried other than Pandas. Also I need to read multiple sheets in the excel file. Please suggest.

https://stackoverflow.com/questions/44196741/how-to-construct-dataframe-from-a-excel-xls-xlsx-file-in-scala-spark — sandeep rawat, Dec 07 '21 at 13:48

score 2 · Answer 1 · edited Dec 08 '21 at 08:31

2

I was getting the same error. Found out the problem was with the package version. I installed the new version 0.13.8 with Scala 2.12 and it's working.

path="/mnt/replacemountpointname/path/filename.xlsx"
df = spark.read.format("com.crealytics.spark.excel").options(header='True', inferSchema='True').load(path)

Link for ref: https://www.youtube.com/watch?v=ib8Zch_4744

edited Dec 08 '21 at 08:31

RiveN

2,595
11
13
26

answered Dec 07 '21 at 21:12

Mohammed Ehtesham

21
2

How to read excel (.xlsx) file into a pyspark dataframe

1 Answers1