Use the following to get a index column that contains monotonically increasing, unique, and consecutive integers, which is not how monotonically_increasing_id()
work. The indexes will be ascending in the same order as colName
of your DataFrame.
import pyspark.sql.functions as F
from pyspark.sql.window import Window as W
window = W.orderBy('colName').rowsBetween(W.unboundedPreceding, W.currentRow)
df = df\
.withColumn('int', F.lit(1))\
.withColumn('index', F.sum('int').over(window))\
.drop('int')\
Use the following code to look at the tail, or last rownums
of the DataFrame.
rownums = 10
df.where(F.col('index')>df.count()-rownums).show()
Use the following code to look at the rows from start_row
to end_row
the DataFrame.
start_row = 20
end_row = start_row + 10
df.where((F.col('index')>start_row) & (F.col('index')<end_row)).show()
zipWithIndex()
is an RDD method that does return monotonically increasing, unique, and consecutive integers, but appears to be much slower to implement in a way where you can get back to your original DataFrame amended with an id column.