How to select a range of rows from a dataframe in pyspark

Question

I have a dataframe with 10609 rows and I want to convert 100 rows at a time to JSON and send them back to a webservice.

I have tried using the LIMIT clause of SQL like

temptable = spark.sql("select item_code_1 from join_table limit 100")

This returns the first 100 rows, but if I want the next 100 rows, I tried this but did not work.

temptable = spark.sql("select item_code_1 from join_table limit 100, 200")

Error: Py4JJavaError: An error occurred while calling o22.sql. : org.apache.spark.sql.catalyst.parser.ParseException: mismatched input ',' expecting (line 1, pos 44)

== SQL ==

select item_code_1 from join_table limit 100, 200

Possible duplicate of [Is there a way to slice dataframe based on index in pyspark?](https://stackoverflow.com/questions/52792762/is-there-a-way-to-slice-dataframe-based-on-index-in-pyspark) — pault, Apr 15 '19 at 14:23

Shantanu Sharma · Answer 1 · 2019-04-16T05:21:19.343

You have to create a row number column which will assign sequential number to column, and use that column for fetch data in range through filter.

df = spark.createDataFrame([('a',),
                            ('b',),
                            ('c',),
                            ('d',),
                            ('e',)
                            ],'item : string')
df.show()

#+----+
#|item|
#+----+
#|   a|
#|   b|
#|   c|
#|   d|
#|   e|
#+----+

I am using a dummy static column lit('a') to generate row_num. Please update below logic (this generates row_num) based on your actual data.

partitionBy(lit('a')).orderBy(lit('a')

Dataframe Example-

from pyspark.sql.functions import lit,row_number,col
from pyspark.sql.window import Window

w = Window().partitionBy(lit('a')).orderBy(lit('a'))

df1 = df.withColumn("row_num", row_number().over(w))

df1.filter(col("row_num").between(1,2)).show()     

#+----+-------+
#|item|row_num|
#+----+-------+
#|   a|      1|
#|   b|      2|
#+----+-------+

df1.filter(col("row_num").between(3,4)).show()

#+----+-------+
#|item|row_num|
#+----+-------+
#|   c|      3|
#|   d|      4|
#+----+-------+

Spark SQL Example-

df1.createOrReplaceTempView("dfTable")

spark.sql("SELECT * FROM dfTable WHERE row_num between 1 and 2").show()

#+----+-------+
#|item|row_num|
#+----+-------+
#|   a|      1|
#|   b|      2|
#+----+-------+

Thank you. This will give me the first n rows, but not n rows from a middle position. — Sridhar Lanka, Apr 15 '19 at 20:02
I have edited my answer, now you would be able to pass range to certain number of records. Ensure that you modify logic of row_num based on your actual data. — Shantanu Sharma, Apr 16 '19 at 05:23

How to select a range of rows from a dataframe in pyspark

1 Answers1

Linked