Pyspark - specifying actual size for train test split instead of ratio?

Asked Aug 16 '20 at 08:06

Active Aug 16 '20 at 10:04

Viewed 433 times

Is it possible to split dataframe into training and testing sets by specifying the actual size i want instead of using ratio? I see most examples use randomSplit..

463715 samples for training

51630 samples for testing

In scikit-learn i was able to do this, for example:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 10000, random_state = 42)

edited Aug 16 '20 at 10:04

desertnaut

57,590
26
140
166

asked Aug 16 '20 at 08:06

James Omnipotent

1

https://stackoverflow.com/a/48887975/1938025 Is this what you want to do? – UtkarshSahu Aug 16 '20 at 11:39
@UtkarshSahu yes this s what i m looking for.. Thank you so much :) – James Omnipotent Aug 16 '20 at 16:31
Does this answer your question? [How to slice a pyspark dataframe in two row-wise](https://stackoverflow.com/questions/48884960/how-to-slice-a-pyspark-dataframe-in-two-row-wise) – UtkarshSahu Aug 16 '20 at 18:15
@UtkarshSahu Yes, but i ran into another problem... when i was using randomSplit() to split my data into train and test set to be used for MachineLearning model, i was able to train my dataset.. but when i used ur method it somehow make my jupyter notebook slower / not loading when it "train my dataset"..... – James Omnipotent Aug 16 '20 at 18:17
@UtkarshSahu it shows this in the jupyter shell.. "Stage 13 contains a task of very large size (23138 KiB). The maximum recommended task size is 1000 KiB." – James Omnipotent Aug 16 '20 at 18:18

Pyspark - specifying actual size for train test split instead of ratio?

0 Answers0