Reading Excel (.xlsx) file in pyspark

Question

I am trying to read a .xlsx file from local path in PySpark.

I've written the below code:

from pyspark.shell import sqlContext
from pyspark.sql import SparkSession

spark = SparkSession.builder \
      .master('local') \
      .appName('Planning') \
      .enableHiveSupport() \
      .config('spark.executor.memory', '2g') \
      .getOrCreate()

df = sqlContext.read("C:\P_DATA\tyco_93_A.xlsx").show()

Error:

TypeError: 'DataFrameReader' object is not callable

Hi @OMG, read allows you to access a DataFrameReader, which enables loading parquet / csv / json / text / excel / ... files with specific methods — baitmbarek, Jan 22 '20 at 08:08
You can take a look at these suggestions first : https://datascience.stackexchange.com/questions/22736/is-there-any-way-to-read-xlsx-file-in-pysparkalso-want-to-read-strings-of-colum — baitmbarek, Jan 22 '20 at 08:14

score 17 · Accepted Answer · answered Jan 22 '20 at 10:32

17

You can use pandas to read .xlsx file and then convert that to spark dataframe.

from pyspark.sql import SparkSession
import pandas

spark = SparkSession.builder.appName("Test").getOrCreate()

pdf = pandas.read_excel('excelfile.xlsx', sheet_name='sheetname', inferSchema='true')
df = spark.createDataFrame(pdf)

df.show()

answered Jan 22 '20 at 10:32

Ghost

492
4
10

Thanks Amit, but getting error like : ImportError: Install xlrd >= 1.0.0 for Excel support – OMG Jan 22 '20 at 10:55
1

xlrd package is not installed. Just pip install xlrd, it will start working. – Ghost Jan 22 '20 at 10:59
1

inferSchema is not (or no longer, probably?) a supported argument. (`TypeError: read_excel() got an unexpected keyword argument 'inferSchema'`) – Sander Vanden Hautte Apr 13 '21 at 10:08
Is there a way to reading an Excel file direct to Spark without using pandas as an intermediate step? – Ramon Feb 08 '22 at 15:38

score 1 · Answer 2 · answered Jul 01 '21 at 07:29

You could use crealytics package.

Need to add it to spark, either by maven co-ordinates or while starting the spark shell as below.

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.12:0.13.1

For databricks users- need to add it as a library by navigating Cluster - 'clusterName' - Libraries - Install New - Provide 'com.crealytics:spark-excel_2.12:0.13.1' under maven coordinates.

df = spark.read
     .format("com.crealytics.spark.excel")
     .option("dataAddress", "'Sheet1'!")
     .option("header", "true")
     .option("inferSchema", "true")
     .load("C:\P_DATA\tyco_93_A.xlsx")

More options are available in below github page.

https://github.com/crealytics/spark-excel

Version 0.14.0 was released in Aug 2021 and it's working. Version 0.15.0, 0.15.1, 0.15.2, 0.16.0 is also release for spark 3, but these are not working, so stick with 0.14.0 — Pramod Kumar Sharma, Feb 03 '22 at 22:35

Reading Excel (.xlsx) file in pyspark

2 Answers2

Linked