I have the following data in the Excel Format:
I want to read this into a Dataframe (Python / Pyspark) The issue i am having is that the Merged Cells are appearing as "null" values and even after using the below code, i cannot merge the first 5 columns into Single.
df.fillna(method ='ffill')
All i want is to skip the top 8 Rows (Being report header, and descriptions) and read the cells starting from 9 (with column headers) and skip the last row as well where it says this is to be ignored.
The resultant dataframe i want should be like the below
Also sometimes the skiprows could be more than 8 as well.. so Want to handled that dynamically and start from the heading at row 9 if that is possible ? instead of hardcoding skip rows as below
df_sheet_multi = pd.read_excel(rpath, sheet_name=None, skiprows = 10)
This is how i did it
ws = wb['Sheet1']
desc = []
sheet_schema = ['description','start_date','end_date','value1','value2','value3']
all_rows = list(ws.rows)
for cell in all_rows[10:99999]:
if cell[0].value != None:
row = [cell[0].value,cell[3].value,cell[4].value,cell[5].value,cell[6].value,cell[7].value]
desc.append(row)
df = spark.createDataFrame(data = desc, schema = sheet_schema)
df2 = df.na.drop(subset=["start_date","end_date","value1","value2","value3"])
display(df2)
One of the the problem with this aproach is that if there are 100s of columns then creating schema is a big problem. I beleive there would be a decent way of doing this ?