0

I am trying to randomly select 100 rows from my PySpark Dataframe. For that I would like to use the code as described in this post:

training_data= data.orderBy(F.rand()).limit(100)

However I get the error:

AttributeError: 'function' object has no attribute 'rand'

I imported rand() the following way:

from pyspark.sql.functions import rand as F

I tried to import rand the same way as decribed in the post, but I get the error:

ModuleNotFoundError: No module named 'org'

I also tried to use the function just as such:

training_data= data.orderBy(rand()).limit(100)

But then I get the following name error:

NameError: name 'rand' is not defined

Does anyone know how to fix it ? I am new to PySpark and I think I am missing something obvious here. Note that I am working on Databricks.

Thank you

Community
  • 1
  • 1
DataBach
  • 1,330
  • 2
  • 16
  • 31

1 Answers1

0

Ok, I actually managed to achieve what I wanted by doing the following:

training_data, test_data = data.randomSplit([0.7, 0.3], seed = 100)
DataBach
  • 1,330
  • 2
  • 16
  • 31