I am trying to create training and testing split for using pyspark
from pyspark.sql import SparkSession
from sklearn.model_selection import train_test_split
spark = SparkSession.builder.appName("My App").config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()
rev = spark.sql("select * from somedb.sometable")
data = rev.select('col1', 'col2').dropDuplicates().dropna(how='any')
train, test = train_test_split(data, random_state=42, test_size=0.33, shuffle=True)
but i get
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "some/path/local/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 2183, in train_test_split
arrays = indexable(*arrays)
File "some/path/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 260, in indexable
check_consistent_length(*result)
File "some/path/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 231, in check_consistent_length
lengths = [_num_samples(X) for X in arrays if X is not None]
File "some/path/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 138, in _num_samples
type(x))
TypeError: Expected sequence or array-like, got <class 'pyspark.sql.dataframe.DataFrame'>
I know that i can use .toPandas()
but that results in my spark job getting stuck as the toPandas method collects all the rows and this causes memory overflow.
How can i use pyspark.sql.dataframe.DataFrame
in the train_test_split
?