2

I am trying to split my data into train and test sets. The data is a Koalas dataframe. However, when I run the below code I am getting the error:

AttributeError: 'DataFrame' object has no attribute 'randomSplit'

Please find below the code I am using:

splits = Closed_new.randomSplit([0.7,0.3])

Besides I tried the usual way of splitting the data after converting the Koalas to pandas. But it takes a lot of time to get executed in Synapse. Below is the code:

state = 12  
test_size = 0.30  
from sklearn.model_selection import train_test_split
  
X_train, X_val, y_train, y_val = train_test_split(Closed_new,labels,  
    test_size=test_size, random_state=state)
Ric S
  • 9,073
  • 3
  • 25
  • 51
  • koalas was merged to pyspark.pandas and won't be continued as a separate project. pyspark.pandas does have randomSplit https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.randomSplit.html – David דודו Markovitz Mar 17 '22 at 12:45

1 Answers1

0

I'm afraid that, at the time of this question, Pyspark's randomSplit does not have an equivalent in Koalas yet.

One trick you can use is to transform the Koalas dataframe into a Spark dataframe, use randomSplit and convert the two subsets to Koalas back again.

splits = Closed_new.to_spark().randomSplit([0.7, 0.3], seed=12)
df_train = splits[0].to_koalas()
df_test = splits[1].to_koalas()
Ric S
  • 9,073
  • 3
  • 25
  • 51