There is no easy way to drop rows by line number because Spark DataFrames do not by default have a concept of order1. There is no "first" or "last" row - each row is treated as an independent block of structured data. This is fundamental to spark and is what what allows it to distribute/parallelize computing - each executor can pick up an arbitrary chunk of the data and process is.
Although your question asks how to drop the first and last rows, I assume what you actually want is to keep the data which follow the correct schema.
If you know the correct schema ahead of time, you can pass it into spark.read.csv
and use mode="DROPMALFORMED"
:
from pyspark.sql.types import IntegerType, StringType, StructField, StructType
schema = StructType(
[
StructField('a', IntegerType()),
StructField('b', StringType()),
StructField('c', StringType()),
StructField('d', IntegerType())
]
)
df = spark.read.csv('sample.txt', sep="|", mode="DROPMALFORMED", schema=schema)
#+---+------+--------+-------+
#| a| b| c| d|
#+---+------+--------+-------+
#|123|sample|customer| 3433|
#|786| ex| data|7474737|
#|987| solve| data| 6364|
#+---+------+--------+-------+
Notes:
You can introduce order via a sort
or with a Window
function. See: Pyspark add sequential and deterministic index to dataframe (and check out the posts linked in the question).
If you truly wanted to drop the first and last rows, you can add line numbers to rdd with zipWithIndex()
, and use this to filter out the smallest and largest line numbers.