-1

I have a complex data frame with 10,999 rows.

I am trying to run xgboost for machine learning.

I load in the data and attempt to split it as I see in tutorials and by solutions posted on StackOverflow: How do I create test and train samples from one dataframe with pandas?

X_train, X_test = train_test_split(df, test_size=0.2)

but this fails:

TypeError: Expected sequence or array-like, got <class 'pyspark.sql.dataframe.DataFrame'>

But this doesn't make sense, how can I possibly put a dataframe into an array without losing lots of valuable information?

so I was advised to try pandas:

pandasDF = df.toPandas
X_train, X_test = train_test_split(pandasDF, test_size=0.2)

but this also fails:

TypeError: Singleton array array(<bound method PandasConversionMixin.toPandas of DataFrame

how can I split this dataframe into training and test sets?

con
  • 5,767
  • 8
  • 33
  • 62

1 Answers1

0

Use this option:

pandasDF = df.toPandas()

If its taking time use this configuration before the conversion

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
Subbu VidyaSekar
  • 2,503
  • 3
  • 21
  • 39
  • 1
    the key point was that I screwed up, and used the attribute instead of the method – con Feb 17 '21 at 03:50