11

I am trying to read a .xlsx file from local path in PySpark.

I've written the below code:

from pyspark.shell import sqlContext
from pyspark.sql import SparkSession

spark = SparkSession.builder \
      .master('local') \
      .appName('Planning') \
      .enableHiveSupport() \
      .config('spark.executor.memory', '2g') \
      .getOrCreate()

df = sqlContext.read("C:\P_DATA\tyco_93_A.xlsx").show()

Error:

TypeError: 'DataFrameReader' object is not callable

baitmbarek
  • 2,440
  • 4
  • 18
  • 26
OMG
  • 243
  • 2
  • 3
  • 12
  • Hi @OMG, read allows you to access a DataFrameReader, which enables loading parquet / csv / json / text / excel / ... files with specific methods – baitmbarek Jan 22 '20 at 08:08
  • @baitmbarek: shall i use .load.... please help – OMG Jan 22 '20 at 08:12
  • You can take a look at these suggestions first : https://datascience.stackexchange.com/questions/22736/is-there-any-way-to-read-xlsx-file-in-pysparkalso-want-to-read-strings-of-colum – baitmbarek Jan 22 '20 at 08:14

2 Answers2

17

You can use pandas to read .xlsx file and then convert that to spark dataframe.

from pyspark.sql import SparkSession
import pandas

spark = SparkSession.builder.appName("Test").getOrCreate()

pdf = pandas.read_excel('excelfile.xlsx', sheet_name='sheetname', inferSchema='true')
df = spark.createDataFrame(pdf)

df.show()
Ghost
  • 492
  • 4
  • 10
  • Thanks Amit, but getting error like : ImportError: Install xlrd >= 1.0.0 for Excel support – OMG Jan 22 '20 at 10:55
  • 1
    xlrd package is not installed. Just pip install xlrd, it will start working. – Ghost Jan 22 '20 at 10:59
  • 1
    inferSchema is not (or no longer, probably?) a supported argument. (`TypeError: read_excel() got an unexpected keyword argument 'inferSchema'`) – Sander Vanden Hautte Apr 13 '21 at 10:08
  • Is there a way to reading an Excel file direct to Spark without using pandas as an intermediate step? – Ramon Feb 08 '22 at 15:38
1

You could use crealytics package.

Need to add it to spark, either by maven co-ordinates or while starting the spark shell as below.

$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.12:0.13.1

For databricks users- need to add it as a library by navigating Cluster - 'clusterName' - Libraries - Install New - Provide 'com.crealytics:spark-excel_2.12:0.13.1' under maven coordinates.

df = spark.read
     .format("com.crealytics.spark.excel")
     .option("dataAddress", "'Sheet1'!")
     .option("header", "true")
     .option("inferSchema", "true")
     .load("C:\P_DATA\tyco_93_A.xlsx")

More options are available in below github page.

https://github.com/crealytics/spark-excel

Deva
  • 11
  • 5
  • Version 0.14.0 was released in Aug 2021 and it's working. Version 0.15.0, 0.15.1, 0.15.2, 0.16.0 is also release for spark 3, but these are not working, so stick with 0.14.0 – Pramod Kumar Sharma Feb 03 '22 at 22:35