I have a requirement where-in I need to read the excel file (with .xlsx extension) in spark/scala. I need to create a dataframe with the data read from excel and apply/write sql queries on top it to do some analysis. The excel file has some column headers/titles like "time_spend_company (Years)", "average_monthly_hours (hours)" etc which as spaces in the headers itself, these spaces are causing me problems to apply any sql queries on the loaded dataframe.
I am using com.crealytics.spark.excel library to parse the excel contents, and my code looks like below
val empFile = "C:\\EmpDatasets.xlsx"
val employeesDF = sc.sqlContext.read
.format("com.crealytics.spark.excel")
.option("sheetName", "Sheet1")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "false")
.option("inferSchema", "false")
.option("location", empFile)
.option("addColorColumns", "False")
.load()
employeesDF.createOrReplaceTempView("EMP")
I want to apply some group by and other aggregate functions on these columns and I am facing issues with these columns like below, my requirement is to apply group by on time_spent_company column and get a count of it.
val expLevel = sc.sqlContext.sql("Select 'time_spend_company (Years)' as 'Years_spent_in_company',count(1) from EMP where left_company = 1 group by 'time_spend_company (Years)'")
expLevel.show
I need help on :-
- Is there any better way to load the excel and assign custom column names to it and create a dataframe?
- how to write sql queries for these column names which has spaces within in it?
Note: I need to read it as excel file only, I can't convert into csv or any other file formats.