I've got a data set with currently 233,465 rows and growing by approximately 10,000 rows daily. I need to randomly select rows from the full data set for usage in ML training. I've added an "id" column for the "index".
from pyspark.sql.functions import monotonically_increasing_id
spark_df = n_data.withColumn("id", monotonically_increasing_id())
I execute the following code expecting to see 5 rows returned where the id's match the "indices" list with a count of 5.
indices = [1000, 999, 45, 1001, 1823, 123476]
result = spark_df.filter(col("id").isin(indices))
result.show()
print(result.count())
instead, I get 3 rows. I get the ids for 45, 1000 and 1001.
Any ideas on what might be wrong here? This seems pretty cut and dry.
Thanks!