0

I have the following data in the Excel Format: enter image description here

I want to read this into a Dataframe (Python / Pyspark) The issue i am having is that the Merged Cells are appearing as "null" values and even after using the below code, i cannot merge the first 5 columns into Single.

df.fillna(method ='ffill')

All i want is to skip the top 8 Rows (Being report header, and descriptions) and read the cells starting from 9 (with column headers) and skip the last row as well where it says this is to be ignored.

The resultant dataframe i want should be like the below

Also sometimes the skiprows could be more than 8 as well.. so Want to handled that dynamically and start from the heading at row 9 if that is possible ? instead of hardcoding skip rows as below

df_sheet_multi = pd.read_excel(rpath, sheet_name=None, skiprows = 10)

enter image description here

This is how i did it

ws = wb['Sheet1']
desc = []
sheet_schema = ['description','start_date','end_date','value1','value2','value3']
all_rows = list(ws.rows)
for cell in all_rows[10:99999]:
   if cell[0].value != None:
      row = [cell[0].value,cell[3].value,cell[4].value,cell[5].value,cell[6].value,cell[7].value]
      desc.append(row)
df = spark.createDataFrame(data = desc, schema = sheet_schema)
df2 = df.na.drop(subset=["start_date","end_date","value1","value2","value3"]) 
display(df2)

One of the the problem with this aproach is that if there are 100s of columns then creating schema is a big problem. I beleive there would be a decent way of doing this ?

Pysparker
  • 93
  • 11

1 Answers1

0

I tried to reproduce the same in my environment and got below results

You need to install these libraries on your databricks cluster to read excel file.

Follow this path: Go to cluster ->select your cluster -> install new -> Select maven -> In Coordinates com.crealytics:spark-excel-2.12.17-3.2.2_2.12:3.2.2_0.18.1

enter image description here

As you can see below code specified with dataAddress .

If you want to return all rows and columns then, use option("dataAddress", "'Sheet1'!A3") -> A3: Staring cell of xlsx data file.

Or

Read only specified range then use .option("dataAddress", "'Sheet1'!A3:F7") . It read's only rows and columns in the specified range.

Sample code read for specific cell range

file_location='/FileStore/sample_me.xlsx' # file location

#Schema Structure
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

s=StructType([StructField("description",StringType(),True),StructField("start_date",StringType(),True),StructField("End_date",StringType(),True),StructField("value1",IntegerType(),True),StructField("value2",IntegerType(),True),StructField("value3",IntegerType(),True)])


sdf = spark.read.format("com.crealytics.spark.excel") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .schema(s) \
    .option("sheetName", "Sheet1") \
    .option("dataAddress", "'Sheet1'!A3:F7").load(file_location)

display(sdf)

enter image description here

Now , you can check successfully executed and got the output:

enter image description here

Update:

if you have 2 excel sheets, try this code

sdf.write.option("mergeSchema","true").format("com.crealytics.spark.excel").option("header","true").option("inferSchema","true").option("dataAddress", "'Sheet1'!A3:F7").mode("append").save("/FileStore/filename.xlsx")

enter image description here

if you have multiple excel sheets (more than 2 excel sheets) follow this SO thread .

B. B. Naga Sai Vamsi
  • 2,386
  • 2
  • 3
  • 11
  • Hi saiVamsi, I already know this method and using it in several places where the header row is at a fixed location. My question is in regards to the situation where you need to scrape the spreadsheet as the start location may vary. Therefore you need to find the row with column headings and start reading below that. And in that case merging the columns accordingly. Also in case of 100s of columns how to define the schema dynamically ? – Pysparker Dec 22 '22 at 20:43
  • Please check the above update. Regarding 100s of columns defining the schema dynamically refer this [link](https://www.youtube.com/watch?v=utXSV8daQ9U) – B. B. Naga Sai Vamsi Dec 23 '22 at 04:56