I am trying to read multiple csv files using spark. I need to skip more than one line of header from each csv file. I am able to achieve this by below code.
rdd = df.rdd
schema = df.schema
rdd_without_header = rdd.zipWithIndex().filter(lambda (row, index): index > skip_header).keys()
df = spark_session.createDataFrame(rdd_without_header, schema=schema)
This code is working fine, but if I am having multiple compressed files of format gz this operation is taking very very long time to complete. Difference is of magnitude 10x when using compressed files as against non compressed files.
Since I want to skip multiple lines of header from all the files, I am not able to leverage the skip header option of spark
option("header", "true")
What should be the best and optimized way to handle this use case.