0

Is it possible to split dataframe into training and testing sets by specifying the actual size i want instead of using ratio? I see most examples use randomSplit..

463715 samples for training

51630 samples for testing

In scikit-learn i was able to do this, for example:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 10000, random_state = 42)
desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • 1
    https://stackoverflow.com/a/48887975/1938025 Is this what you want to do? – UtkarshSahu Aug 16 '20 at 11:39
  • @UtkarshSahu yes this s what i m looking for.. Thank you so much :) – James Omnipotent Aug 16 '20 at 16:31
  • Does this answer your question? [How to slice a pyspark dataframe in two row-wise](https://stackoverflow.com/questions/48884960/how-to-slice-a-pyspark-dataframe-in-two-row-wise) – UtkarshSahu Aug 16 '20 at 18:15
  • @UtkarshSahu Yes, but i ran into another problem... when i was using randomSplit() to split my data into train and test set to be used for MachineLearning model, i was able to train my dataset.. but when i used ur method it somehow make my jupyter notebook slower / not loading when it "train my dataset"..... – James Omnipotent Aug 16 '20 at 18:17
  • @UtkarshSahu it shows this in the jupyter shell.. "Stage 13 contains a task of very large size (23138 KiB). The maximum recommended task size is 1000 KiB." – James Omnipotent Aug 16 '20 at 18:18

0 Answers0