3

I have a spark dataframe which is like

id  start_time   feature
1   01-01-2018   3.567
1   01-02-2018   4.454
1   01-03-2018   6.455
2   01-02-2018   343.4
2   01-08-2018   45.4
3   02-04-2018   43.56
3   02-07-2018   34.56
3   03-07-2018   23.6

I want to be able to split this into two dataframes based on the id column.So I should groupby the id column, sort by start_time and take 70% of the rows into one dataframe and 30% of the rows into another dataframe by preserving the order.The result should look like:

Dataframe1:
id  start_time   feature
1   01-01-2018   3.567
1   01-02-2018   4.454
2   01-02-2018   343.4
3   02-04-2018   43.56
3   02-07-2018   34.56

Dataframe2:
1   01-03-2018   6.455
2   01-08-2018   45.4
3   03-07-2018   23.6

I am using Spark 2.0 with python. What is the best way to implement this?

Gayatri
  • 2,197
  • 4
  • 23
  • 35
  • Possible duplicate of [Stratified sampling in Spark](https://stackoverflow.com/questions/32238727/stratified-sampling-in-spark) and [Stratified sampling with pyspark](https://stackoverflow.com/q/47637760/10465355) – 10465355 Oct 23 '18 at 21:45
  • I have checked that answer but it does not answer on how to preserve the order while doing the split – Gayatri Oct 24 '18 at 01:43
  • From the example you have posted, it looks like you are taking the latest date in each group to be part of one of the split dfs, is this a requirement? If not I would say just sort by I'd then start_time then take random splits – sramalingam24 Oct 24 '18 at 06:20

1 Answers1

4

The way I had to do it was to create two windows:

w1 =  Window.partitionBy(df.id).orderBy(df.start_time)
w2 =  Window.partitionBy(df.id)

df = df.withColumn("row_number",F.row_number().over(w1))\
                     .withColumn("count",F.count("id").over(w2))\
                     .withColumn("percent",(F.col("row_number")/F.col("count")))
train = df.filter(df.percent<=0.70)
test = df.filter(df.percent>0.70)
Gayatri
  • 2,197
  • 4
  • 23
  • 35