Reading Excel file Using PySpark: Failed to find data source: com.crealytics.spark.excel

Question

I'm trying to read an excel file with spark using jupyter in vscode,with java version of 1.8.0_311 (Oracle Corporation), and scala version of version 2.12.15.

Here is the code below:

# import necessary library 
import pandas as pd 
from pyspark.sql.types import StructType

# entry point for spark's functionality 
from pyspark import SparkContext, SparkConf, SQLContext 
    
configure = SparkConf().setAppName("name").setMaster("local")
sc = SparkContext(conf= configure)
sql = SQLContext(sc)

# entry point for spark's dataframes
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("pharmacy scraper") \
    .config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
    .getOrCreate()

# reading excel file 
df_generika = spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").option("inferSchema", "true").option("dataAddress", "Sheet1").load("./../data/raw-data/generika.xlsx")

Unfortunately, it produces an error

Py4JJavaError: An error occurred while calling o36.load.
: java.lang.ClassNotFoundException: 
Failed to find data source: com.crealytics.spark.excel. Please find packages at
http://spark.apache.org/third-party-projects.html

score 1 · Accepted Answer · answered Dec 26 '21 at 06:00

1

Turns out winutils isn't installed.

answered Dec 26 '21 at 06:00

Renz Carillo

349
2
11

score 0 · Answer 2 · answered Dec 24 '21 at 12:11

0

Check your Classpath: you must have the Jar containing com.crealytics.spark.excel in it.

With Spark, the architecture is a bit different than traditional applications. You may need to have the Jar at different location: in your application, at the master level, and/or worker level. Ingestion (what you’re doing) is done by the worker, so make sure they have this Jar in their classpath.

answered Dec 24 '21 at 12:11

jgp

2,069
1
21
40

i just checked the classpath entries at spark ui and nothing seems to matched excel but im pretty sure i added it at the creation of spark session `.config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2")` – Renz Carillo Dec 24 '21 at 12:38
The Spark session “runs” in your master, but the worker does the ingestion. – jgp Dec 24 '21 at 12:42
spark ui > environment > classpath entries > all of the jar all belongs to system classpath, where do i find classpath catered for worker specifically? – Renz Carillo Dec 24 '21 at 12:48
@jgp I doubt classpath can be an issue here because `spark.jars.packages` (if successful) "...Comma-separated list of Maven coordinates of jars to include on the driver and executor classpaths..." – mazaneicha Dec 24 '21 at 16:19

Reading Excel file Using PySpark: Failed to find data source: com.crealytics.spark.excel

2 Answers2