0

I've got a data set with currently 233,465 rows and growing by approximately 10,000 rows daily. I need to randomly select rows from the full data set for usage in ML training. I've added an "id" column for the "index".

from pyspark.sql.functions import monotonically_increasing_id
spark_df = n_data.withColumn("id", monotonically_increasing_id())

I execute the following code expecting to see 5 rows returned where the id's match the "indices" list with a count of 5.

indices = [1000, 999, 45, 1001, 1823, 123476]
result = spark_df.filter(col("id").isin(indices))
result.show()
print(result.count())

instead, I get 3 rows. I get the ids for 45, 1000 and 1001.

Any ideas on what might be wrong here? This seems pretty cut and dry.

Thanks!

David Crook
  • 2,722
  • 3
  • 23
  • 49
  • 1
    [monotonically_icreasing_id](http://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.functions.monotonically_increasing_id) produces unique but not consecutive numbers. Rows with 123476, 1823 and 999 simply doesn't exist in your dataframe. – cronoik Aug 23 '19 at 18:18
  • 1
    Check the comments of this [question](https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe) for examples to assign a sequential number. – cronoik Aug 23 '19 at 18:22
  • ah, lame. I guess that would explain the behavior I am getting. – David Crook Aug 23 '19 at 18:23
  • any idea how to do a mydf.withColumn("id", spark.range(0, n_data.count(), 1)); I try wrapping in column, col, .toDF() and get various errors accross that gambit. – David Crook Aug 23 '19 at 18:30
  • That doesn't work because spark doesn't know how to assign the values. Just follow the comments (not the answers) of that [question](https://stackoverflow.com/questions/51200217/how-to-create-sequential-number-column-in-pyspark-dataframe). – cronoik Aug 23 '19 at 18:36
  • I ended up solving my problem differently using sample which achieved my end goal. I'll keep as open question; because this would be cool to know how to do anyways. – David Crook Aug 23 '19 at 19:12

1 Answers1

0

There's no direct function to call to assign unique, sequential IDs to each row. But there's workaround using window-based function.

df = spark.createDataFrame([(3,),(7,),(9,),(1,),(-3,),(5,)], ["values"])
df.show()

+------+
|values|
+------+
|     3|
|     7|
|     9|
|     1|
|    -3|
|     5|
+------+



df = (df.withColumn('dummy', F.monotonically_increasing_id())
       .withColumn('ID', F.row_number().over(Window.orderBy('dummy')))
       .drop('dummy'))
df.show()

+------+---+
|values| ID|
+------+---+
|     3|  1|
|     7|  2|
|     9|  3|
|     1|  4|
|    -3|  5|
|     5|  6|
+------+---+

niuer
  • 1,589
  • 2
  • 11
  • 14