Remove first and last row from the text file in pyspark

Question

I have file like below

H201908127477474
123|sample|customer|3433
786|ex|data|7474737
987|solve|data|6364
T3 637373

I want to remove the first row and last row from the file. Please give me some solution in pyspark

I am using this to load the file

df=spark.read.format('csv').load('sample.txt')

Need to remove H201908127477474 and T3 637373 row from the file — Selvakumar, Sep 09 '19 at 15:44
it won't specifically remove the first and last rows, but `df = spark.read.csv('sample.txt', sep="|", mode="DROPMALFORMED")` might work for you in this case. — pault, Sep 09 '19 at 15:46
After using this it returned only ```+--------------------+ | _c0| +--------------------+ |H201908127477474NO...| |T3 637373...| +--------------------+``` — Selvakumar, Sep 09 '19 at 16:45

pault · Accepted Answer · 2019-09-09T17:34:22.297

There is no easy way to drop rows by line number because Spark DataFrames do not by default have a concept of order¹. There is no "first" or "last" row - each row is treated as an independent block of structured data. This is fundamental to spark and is what what allows it to distribute/parallelize computing - each executor can pick up an arbitrary chunk of the data and process is.

Although your question asks how to drop the first and last rows, I assume what you actually want is to keep the data which follow the correct schema.

If you know the correct schema ahead of time, you can pass it into spark.read.csv and use mode="DROPMALFORMED":

from pyspark.sql.types import IntegerType, StringType, StructField, StructType

schema = StructType(
    [
        StructField('a', IntegerType()),
        StructField('b', StringType()),
        StructField('c', StringType()),
        StructField('d', IntegerType())
    ]
)
df = spark.read.csv('sample.txt', sep="|", mode="DROPMALFORMED", schema=schema)
#+---+------+--------+-------+
#|  a|     b|       c|      d|
#+---+------+--------+-------+
#|123|sample|customer|   3433|
#|786|    ex|    data|7474737|
#|987| solve|    data|   6364|
#+---+------+--------+-------+

Notes:

You can introduce order via a sort or with a Window function. See: Pyspark add sequential and deterministic index to dataframe (and check out the posts linked in the question).
If you truly wanted to drop the first and last rows, you can add line numbers to rdd with zipWithIndex(), and use this to filter out the smallest and largest line numbers.

Remove first and last row from the text file in pyspark

1 Answers1

Linked