How to get `train_test_split` to work with a dataframe?

Question

I have a complex data frame with 10,999 rows.

I am trying to run xgboost for machine learning.

I load in the data and attempt to split it as I see in tutorials and by solutions posted on StackOverflow: How do I create test and train samples from one dataframe with pandas?

X_train, X_test = train_test_split(df, test_size=0.2)

but this fails:

TypeError: Expected sequence or array-like, got <class 'pyspark.sql.dataframe.DataFrame'>

But this doesn't make sense, how can I possibly put a dataframe into an array without losing lots of valuable information?

so I was advised to try pandas:

pandasDF = df.toPandas
X_train, X_test = train_test_split(pandasDF, test_size=0.2)

but this also fails:

TypeError: Singleton array array(<bound method PandasConversionMixin.toPandas of DataFrame

how can I split this dataframe into training and test sets?

Could it be that the only error is that you forgot the method call? So use `pandasDF = df.toPandas()` instead? — VirtualScooter, Feb 17 '21 at 03:29
indeed, I used attribute instead of method :( now it works @VirtualScooter — con, Feb 17 '21 at 03:49

score 0 · Accepted Answer · answered Feb 17 '21 at 03:34

0

Use this option:

pandasDF = df.toPandas()

If its taking time use this configuration before the conversion

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

answered Feb 17 '21 at 03:34

Subbu VidyaSekar

1

the key point was that I screwed up, and used the attribute instead of the method – con Feb 17 '21 at 03:50

1 Answers1