Drop rows in Pyspark

Question

How can I drop the row values in Pyspark based on the value of row number/row index value?

I am new to Pyspark (and coding) -- I have tried coding something but it is not working.

Spark DataFrames do not have row numbers or row index values in the way pandas DataFrames do. So the answer to your question as it's written is "you can not." If you're looking for a different answer, please first take some time to take the [tour] and read [ask]. Then [edit] your question to include a [small reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). — pault, Apr 08 '19 at 17:49

score 3 · Answer 1 · answered Apr 09 '19 at 08:21

You can't drop specific cols, but you can just filter the ones you want, by using filter or its alias, where.

Imagine you want "to drop" the rows where the age of a person is lower than 3. You can just keep the opposite rows, like this:

df.filter(df.age >= 3)

score 1 · Accepted Answer · answered Apr 09 '19 at 07:18

import pyspark.sql.functions as F
schema1 = StructType([StructField('rownumber', IntegerType(), True),StructField('name', StringType(), True)])
data1 = [(1,'a'),(2,'b'),(3,'c'),(4,'d'),(5,'e')]
df1 = spark.createDataFrame(data1, schema1)
df1.show()
+---------+----+
|rownumber|name|
+---------+----+
|        1|   a|
|        2|   b|
|        3|   c|
|        4|   d|
|        5|   e|
+---------+----+
df1.filter(F.col("rownumber").between(2,4)).show()
+---------+----+
|rownumber|name|
+---------+----+
|        2|   b|
|        3|   c|
|        4|   d|
+---------+----+

Drop rows in Pyspark

2 Answers2