0

I'm trying use Pyspark from AWS EMR to read Excel file it resides s3,In order to do this I have downloaded spark-excel jars spark-excel_2.11-0.12.4.jar and spark-excel_2.12-0.13.5.jar and places into s3 bucket

scenario 1:
===========
df = spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").option("inferschema", "true").load("s3://bucket/abc.xlsx")

spark-submit --jars s3://Bucket/spark-excel_2.11-0.12.4.jar test.py

Error:
Caused by: java.lang.NoClassDefFoundError: org/apache/commons/collections4/IteratorUtils

scenario2:
=========
df = spark.read.format("com.crealytics.spark.excel").option("header", "true").option("inferschema", "true").load("s3://bucket/abc.xlsx")

spark-submit --jars s3://Bucket/spark-excel_2.12-0.13.5.jar test.py

Error:
py4j.protocol.Py4JJavaError: An error occurred while calling o79.load.
: java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)

Can some one please assist me to fix this issue ? I appreciate your help !

N9909
  • 189
  • 1
  • 2
  • 23

2 Answers2

4

You can read it from excel directly. Indeed, this should be a better practice than involving pandas since then the benefit of Spark would not exist anymore.

You can run the same code sample as defined above, but just adding the class needed to the configuration of your SparkSession.

spark = SparkSession.builder \
.master("local") \
.appName("Word Count") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
.getOrCreate()

Then, you can read your excel file.

df = spark.read.format("com.crealytics.spark.excel") \
.option("useHeader", "true") \
.option("inferSchema", "true") \
.option("dataAddress", "NameOfYourExcelSheet") \
.load("your_file"))
Amardeep Flora
  • 1,255
  • 6
  • 13
  • 29
  • 3
    I have tested after adding the class to Spark Session config("spark.jars.packages","com.crealytics:spark-excel_2.11-0.12.4"), still I have same error Caused by: java.lang.NoClassDefFoundError: org/apache/commons/collections4/IteratorUtils. Any thoughts ? – N9909 Feb 15 '21 at 02:06
0

Probably the excel document can fit into the driver's memory. In that case, I would read the file using Pandas directly from S3 and then create a Spark DataFrame more it. Try something like:

import pandas as pd
pdf = pd.read_excel("s3://bucket/abc.xlsx")
pdf
   col1 col2
0   1.2    a
1   2.3    b

sdf = spark.createDataFrame(pdf)
sdf.show()

+----+----+
|col1|col2|
+----+----+
| 1.2|   a|
| 2.3|   b|
+----+----+

EDIT:

This suggested approach is recomended only for very few files which you may want to join with other dataframes. If your big dataset comes from xlsx files I recommend you to follow the com.crealytics.spark.excel solution. However, I would be more worried about using excel format in the first place.

Emer
  • 3,734
  • 2
  • 33
  • 47
  • Is there any package required to access S3 bucket while using pandas from EMR ? – N9909 Feb 13 '21 at 00:18
  • I don't have an EMR running to test the code. In theory, pandas uses boto behind the scenes and it is installed in EMRs. Make sure the EMR has access role to read from S3. A starting point is this post: https://stackoverflow.com/questions/37703634/how-to-import-a-text-file-on-aws-s3-into-pandas-without-writing-to-disk – Emer Feb 13 '21 at 00:39
  • Thanks for information, and also I'm looking for other way as well if we get large size of files in future.Somehow those jars not at all working. – N9909 Feb 13 '21 at 01:50