3

I'm trying to read an excel file with spark using jupyter in vscode,with java version of 1.8.0_311 (Oracle Corporation), and scala version of version 2.12.15.

Here is the code below:

# import necessary library 
import pandas as pd 
from pyspark.sql.types import StructType

# entry point for spark's functionality 
from pyspark import SparkContext, SparkConf, SQLContext 
    
configure = SparkConf().setAppName("name").setMaster("local")
sc = SparkContext(conf= configure)
sql = SQLContext(sc)

# entry point for spark's dataframes
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("pharmacy scraper") \
    .config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2") \
    .getOrCreate()

# reading excel file 
df_generika = spark.read.format("com.crealytics.spark.excel").option("useHeader", "true").option("inferSchema", "true").option("dataAddress", "Sheet1").load("./../data/raw-data/generika.xlsx")

Unfortunately, it produces an error

Py4JJavaError: An error occurred while calling o36.load.
: java.lang.ClassNotFoundException: 
Failed to find data source: com.crealytics.spark.excel. Please find packages at
http://spark.apache.org/third-party-projects.html
mazaneicha
  • 8,794
  • 4
  • 33
  • 52
Renz Carillo
  • 349
  • 2
  • 11

2 Answers2

1

Turns out winutils isn't installed.

Renz Carillo
  • 349
  • 2
  • 11
0

Check your Classpath: you must have the Jar containing com.crealytics.spark.excel in it.

With Spark, the architecture is a bit different than traditional applications. You may need to have the Jar at different location: in your application, at the master level, and/or worker level. Ingestion (what you’re doing) is done by the worker, so make sure they have this Jar in their classpath.

jgp
  • 2,069
  • 1
  • 21
  • 40
  • i just checked the classpath entries at spark ui and nothing seems to matched excel but im pretty sure i added it at the creation of spark session `.config("spark.jars.packages", "com.crealytics:spark-excel_2.11:0.12.2")` – Renz Carillo Dec 24 '21 at 12:38
  • The Spark session “runs” in your master, but the worker does the ingestion. – jgp Dec 24 '21 at 12:42
  • spark ui > environment > classpath entries > all of the jar all belongs to system classpath, where do i find classpath catered for worker specifically? – Renz Carillo Dec 24 '21 at 12:48
  • @jgp I doubt classpath can be an issue here because `spark.jars.packages` (if successful) "...Comma-separated list of Maven coordinates of jars to include on the driver and executor classpaths..." – mazaneicha Dec 24 '21 at 16:19