4

How can I drop the row values in Pyspark based on the value of row number/row index value?

I am new to Pyspark (and coding) -- I have tried coding something but it is not working.

ugexe
  • 5,297
  • 1
  • 28
  • 49
Shravan K
  • 105
  • 1
  • 7
  • Spark DataFrames do not have row numbers or row index values in the way pandas DataFrames do. So the answer to your question as it's written is "you can not." If you're looking for a different answer, please first take some time to take the [tour] and read [ask]. Then [edit] your question to include a [small reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). – pault Apr 08 '19 at 17:49
  • Then how can i drop rows of a particular range? – Shravan K Apr 09 '19 at 05:39

2 Answers2

3

You can't drop specific cols, but you can just filter the ones you want, by using filter or its alias, where.

Imagine you want "to drop" the rows where the age of a person is lower than 3. You can just keep the opposite rows, like this:

df.filter(df.age >= 3)
Manrique
  • 2,083
  • 3
  • 15
  • 38
1
import pyspark.sql.functions as F
schema1 = StructType([StructField('rownumber', IntegerType(), True),StructField('name', StringType(), True)])
data1 = [(1,'a'),(2,'b'),(3,'c'),(4,'d'),(5,'e')]
df1 = spark.createDataFrame(data1, schema1)
df1.show()
+---------+----+
|rownumber|name|
+---------+----+
|        1|   a|
|        2|   b|
|        3|   c|
|        4|   d|
|        5|   e|
+---------+----+
df1.filter(F.col("rownumber").between(2,4)).show()
+---------+----+
|rownumber|name|
+---------+----+
|        2|   b|
|        3|   c|
|        4|   d|
+---------+----+
Prathik Kini
  • 1,067
  • 11
  • 25