2

I have a requirement where-in I need to read the excel file (with .xlsx extension) in spark/scala. I need to create a dataframe with the data read from excel and apply/write sql queries on top it to do some analysis. The excel file has some column headers/titles like "time_spend_company (Years)", "average_monthly_hours (hours)" etc which as spaces in the headers itself, these spaces are causing me problems to apply any sql queries on the loaded dataframe.

I am using com.crealytics.spark.excel library to parse the excel contents, and my code looks like below

val empFile = "C:\\EmpDatasets.xlsx"

val employeesDF = sc.sqlContext.read
  .format("com.crealytics.spark.excel")
  .option("sheetName", "Sheet1")
  .option("useHeader", "true")
  .option("treatEmptyValuesAsNulls", "false")
  .option("inferSchema", "false")
  .option("location", empFile)
  .option("addColorColumns", "False")
  .load()

employeesDF.createOrReplaceTempView("EMP")

I want to apply some group by and other aggregate functions on these columns and I am facing issues with these columns like below, my requirement is to apply group by on time_spent_company column and get a count of it.

val expLevel = sc.sqlContext.sql("Select 'time_spend_company (Years)' as 'Years_spent_in_company',count(1) from EMP where left_company = 1 group by 'time_spend_company (Years)'")
expLevel.show

I need help on :-

  1. Is there any better way to load the excel and assign custom column names to it and create a dataframe?
  2. how to write sql queries for these column names which has spaces within in it?

Note: I need to read it as excel file only, I can't convert into csv or any other file formats.

Krishnan
  • 958
  • 5
  • 21
  • 44

4 Answers4

3

For the version 0.13.5 you will need a different set of parameters:

def readExcel(file: String): DataFrame = {
    sqlContext.read
      .format("com.crealytics.spark.excel")
      .option("dataAddress", "'sheet_name'!A1") // Optional, default: "A1"
      .option("header", "true") // Required
      .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
      .option("inferSchema", "true") // Optional, default: false
      .option("addColorColumns", "false") // Optional, default: false
      .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
      .option("maxRowsInMemory", 20) // Optional, d[#All]efault None. If set, uses a streaming reader which can help with big files
      .load(file)
  }

maven dependency:

<dependency>
  <groupId>com.crealytics</groupId>
  <artifactId>spark-excel_2.11</artifactId>
  <version>0.13.5</version>
</dependency>
1
  1. Spark has good support for working with CSVs. So if your excel file has just one sheet you can convert it to CSV by simply renaming EmpDatasets.xlsx to EmpDatasets.csv. Use this to do it.

Once you have your file as CSV, you can read it as spark.read.csv(pathToCSV) and can supply many options like: to read/skip header or supply schema of the dataset as spark.read.schema(schema).csv(pathToCSV).

Here schema can be created as described here or can be extracted from a case class using spark sql Encoders Encoders.product[case_class_name].schema

  1. You can remove spaces from the column names like:

val employeesDFColumns = employeesDF.columns.map(x => col(x.replaceAll(" ", "")))

And apply these new column names on the dataframe.

val employeeDF = employeeDF.select(employeesDFColumns:_*)

Santoshi M
  • 123
  • 5
  • I need to read it as excel file only, I can't convert into csv or any other file formats. Can you please suggest anything on reading/parsing excel files? – Krishnan Nov 22 '17 at 20:08
  • The library you are using for excel seems perfectly alright. And is one of the recommended ways to read excel. What are the problem you are facing in using it? – Santoshi M Nov 22 '17 at 20:18
  • Please refer my question, I need to assign custom column names to it and create a dataframe, how to write sql queries for these column names which has spaces within in it? – Krishnan Nov 22 '17 at 21:57
1

Answers for you question 2: Inspite of using ' you need to use ` before the start and end of the column names with spaces. Try below query it will work:

val expLevel = sc.sqlContext.sql("Select `time_spend_company (Years)` as `Years_spent_in_company`,count(1) from EMP where left_company = 1 group by `time_spend_company (Years)`")

Question 1: Loading excel using "com.crealytics.spark.excel" is ok. I am also using it. There can be different option too. For assigning a different column name, you can use the Struct Type to define the schema and impose it during the loading the data into dataframe. e.g

val newSchema = StructType(
    List(StructField("a", IntegerType, nullable = true),
         StructField("b", IntegerType, nullable = true),
         StructField("c", IntegerType, nullable = true),
         StructField("d", IntegerType, nullable = true))
  )

val employeesDF = spark.read.schema(newSchema)
  .format("com.crealytics.spark.excel")
  .option("sheetName", "Sheet1")
  .option("useHeader", "true")
  .option("treatEmptyValuesAsNulls", "false")
  .option("inferSchema", "false")
  .option("location", empFile)
  .option("addColorColumns", "False")
  .load()

The first four column names now will be accessed by a,b,c and d. Run below query it will work on new column names.

sc.sqlContext.sql("select a,b,c,d from EMP").show()
Amit Kumar
  • 1,544
  • 11
  • 23
  • I tried defining the schema as per your suggestion and loaded the dataset, but I am getting the exception like below:- Exception in thread "main" org.apache.spark.sql.AnalysisException: com.crealytics.spark.excel does not allow user-specified schemas.; at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:313) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146) at EmpPerformanceAnalysis$.main(EmpPerformanceAnalysis.scala:58) – Krishnan Nov 24 '17 at 01:16
  • val empSchema = StructType( List(StructField("satisfaction_level", FloatType, nullable = true), StructField("last_evaluation", FloatType, nullable = true), .. StructField("salary", StringType, nullable = true)) ) val employeesDF = sc.sqlContext.read.schema(empSchema) .format("com.crealytics.spark.excel") .option("sheetName", "Sheet1") .option("useHeader", "true") .option("treatEmptyValuesAsNulls", "false") .option("inferSchema", "false") .option("location", empFile) .option("addColorColumns", "False") .load() – Krishnan Nov 24 '17 at 01:25
  • @Krishnan I am running the same on spark 2.1.0 and scala 2.11. I am not getting the exception that you have mentioned. For the code you have posted, i am getting the excpetion "java.lang.IllegalArgumentException: Parameter "path" is missing in options." . Then i added the path option inplace of location option and it works fine. – Amit Kumar Nov 24 '17 at 04:13
  • This is what i am running : val employeesDF2 = spark.sqlContext.read.schema(empSchema).format("com.crealytics.spark.excel").option("sheetName", "Sheet1").option("useHeader", "true").option("treatEmptyValuesAsNulls", "false").option("inferSchema", "false").option("location", "csvdatatest/Book1.xlsx").option("addColorColumns", "False").load("csvdatatest/Book1.xlsx") – Amit Kumar Nov 24 '17 at 04:14
  • printschema result is : scala> employeesDF2.printSchema root |-- satisfaction_level: float (nullable = true) |-- last_evaluation: float (nullable = true) |-- salary: string (nullable = true) – Amit Kumar Nov 24 '17 at 04:17
  • This is the excel library version : com.crealytics spark-excel_2.11 0.9.8 – Amit Kumar Nov 24 '17 at 04:19
  • version was the problem, after adding the latest lib it worked fine, thanks – Krishnan Nov 27 '17 at 21:58
0

We can save PySpark data to an Excel file using the pandas library, which provides functionality to write data in the Excel format.

from pyspark.sql import SparkSession
import pandas as pd

# Create a Spark session
spark = SparkSession.builder.appName("PySpark to Excel").getOrCreate()

# Create a PySpark dataframe
df = spark.createDataFrame([(1, "John Doe", 30), (2, "Jane Doe", 35), (3, "Jim Brown", 40)], 
                           ["id", "name", "age"])

# Convert the PySpark dataframe to a pandas dataframe
pandas_df = df.toPandas()

# Write the pandas dataframe to an Excel file
pandas_df.to_excel("output.xlsx", index=False)

# Stop the Spark session
spark.stop()
Eric Aya
  • 69,473
  • 35
  • 181
  • 253
Abhi
  • 1
  • 1