0

I am trying to read multiple csv files using spark. I need to skip more than one line of header from each csv file. I am able to achieve this by below code.

            rdd = df.rdd
            schema = df.schema
            rdd_without_header = rdd.zipWithIndex().filter(lambda (row, index): index > skip_header).keys()
            df = spark_session.createDataFrame(rdd_without_header, schema=schema)

This code is working fine, but if I am having multiple compressed files of format gz this operation is taking very very long time to complete. Difference is of magnitude 10x when using compressed files as against non compressed files.

Since I want to skip multiple lines of header from all the files, I am not able to leverage the skip header option of spark

option("header", "true")

What should be the best and optimized way to handle this use case.

Manish Mehra
  • 1,381
  • 1
  • 16
  • 24
  • Does this look similar to your query - https://stackoverflow.com/questions/59066489/reading-a-text-file-with-multiple-headers-in-spark/59066751#comment104372594_59066751 – Kumar Rohit Jan 16 '20 at 13:29
  • if you know the schema of your file, you can use the classic `spark.read.csv` with `mode='DROPMALFORMED'` (works with the condition that there is at least one column which is not a string) – Steven Jan 16 '20 at 13:31
  • there is a small difference, i can have headers spanning multiple lines. Example i want to ready 5csv files from a given location and all csv files have first 3 lines as headers which i want to skip. @KumarRohit – Manish Mehra Jan 16 '20 at 13:31
  • Just give the path until the directory containing multiple CSVs. It'll work just fine then onwards. – Kumar Rohit Jan 16 '20 at 13:34
  • Does this answer your question? [How to skip more then one lines of header in RDD in Spark](https://stackoverflow.com/questions/32877326/how-to-skip-more-then-one-lines-of-header-in-rdd-in-spark). The most efficient option is `mapPartitionsWithIndex()`. – SergiyKolesnikov Jan 16 '20 at 15:29

0 Answers0