1

I want to do a train test split on sorted Pyspark data frame based on time. Say that first 300 rows will be in train set and next 200 rows in test split.

I can select first first 300 rows with -

train = df.show(300)

but how can I select the last 200 rows from Pyspark dataframe?

pault
  • 41,343
  • 15
  • 107
  • 149
Aritra Sen
  • 143
  • 1
  • 11
  • Related posts: [How to slice a pyspark dataframe in two row-wise](https://stackoverflow.com/questions/48884960/how-to-slice-a-pyspark-dataframe-in-two-row-wise/48888185#48888185), [Split Time Series pySpark data frame into test & train without using random split](https://stackoverflow.com/questions/51772908/split-time-series-pyspark-data-frame-into-test-train-without-using-random-spli/51773836#51773836), and [Is there a way to slice dataframe based on index in pyspark?](https://stackoverflow.com/questions/52792762/is-there-a-way-to-slice-dataframe-based-on-index-in-pyspark/52819758#52819758) – pault Mar 13 '19 at 15:31

1 Answers1

0

Let's say that you have a dataframe df of size 500 sorted by the time column.

A simple way to go at it would be to use limit for the training set, and to do the same on the reversed dataframe for the test set.

from pyspark.sql.functions import desc
train = df.limit(300)
test = df.orderBy(desc("time")).limit(200).orderBy("time")
Oli
  • 9,766
  • 5
  • 25
  • 46
  • What if we don't know size of dataframe and don't want to use count as well as window functions as count and window functions without partitionswould degrade the performance. – drp Jun 17 '20 at 20:56
  • Compared to what? – Oli Aug 31 '20 at 18:03