1

I am trying to create training and testing split for using pyspark

from pyspark.sql import SparkSession
from sklearn.model_selection import train_test_split

spark = SparkSession.builder.appName("My App").config("spark.sql.warehouse.dir", warehouse_location).enableHiveSupport().getOrCreate()

rev = spark.sql("select * from somedb.sometable")

data = rev.select('col1', 'col2').dropDuplicates().dropna(how='any')

train, test = train_test_split(data, random_state=42, test_size=0.33, shuffle=True)

but i get

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "some/path/local/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 2183, in train_test_split
    arrays = indexable(*arrays)
  File "some/path/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 260, in indexable
    check_consistent_length(*result)
  File "some/path/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 231, in check_consistent_length
    lengths = [_num_samples(X) for X in arrays if X is not None]
  File "some/path/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 138, in _num_samples
    type(x))
TypeError: Expected sequence or array-like, got <class 'pyspark.sql.dataframe.DataFrame'>

I know that i can use .toPandas() but that results in my spark job getting stuck as the toPandas method collects all the rows and this causes memory overflow.

How can i use pyspark.sql.dataframe.DataFrame in the train_test_split ?

AbtPst
  • 7,778
  • 17
  • 91
  • 172
  • 1
    You can't use `sklearn`'s `test_train_split` directly on a spark DataFrame. Use [`pyspark.sql.DataFrame.randomSplit`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.randomSplit) instead. – pault Feb 11 '20 at 19:10
  • thanks, but will i be able to use the result for fitting and training sklearn models ? – AbtPst Feb 11 '20 at 19:46
  • 1
    Not unless you collect the results (I.e. use `toPandas()`). If you're looking to train an sklean model, don't use spark. – pault Feb 11 '20 at 19:58
  • ok, i see. how can i create ML models using pyspark ? can i not use sklearn ? – AbtPst Feb 11 '20 at 21:00
  • 1
    https://spark.apache.org/docs/latest/ml-guide.html – pault Feb 11 '20 at 21:43

0 Answers0