2

I have Data Frame like below with three column

id|visit_class|in_date 
+--+-----------+--------
|1|Non Hf     |24-SEP-2017
|1|Non Hf     |23-SEP-2017
|1|Hf         |27-SEP-2017
|1|Non Hf     |28-SEP-2017
|2|Non Hf     |24-SEP-2017
|2|Hf         |25-SEP-2017

I want to group this data frame on id then sort this grouped data on indate column and want only those rows which are coming after first occurrence of HF. The output will be like below. Means first 2 rows will drop for id =1 and first 1 row will drop for id = 2.

id|visit_class|in_date 
+--+-----------+--------
|1|Hf         |27-SEP-2017
|1|Non Hf     |28-SEP-2017
|2|Hf         |25-SEP-2017

How I will achieve this in Spark and Scala.

zero323
  • 322,348
  • 103
  • 959
  • 935
Divas Nikhra
  • 91
  • 2
  • 12

1 Answers1

4

Steps:

1) Create the WindowSpec, order by date and partition by id:

2) Create a cumulative sum as indicates of whether Hf has appeared, and then filter based on the condition:

import org.apache.spark.sql.expressions.Window

val w = Window.partitionBy("id").orderBy(to_date($"in_date", "dd-MMM-yyyy"))
(df.withColumn("rn", sum(when($"visit_class" === "Hf", 1).otherwise(0)).over(w))
   .filter($"rn" >= 1).drop("rn").show)

+---+-----------+-----------+
| id|visit_class|    in_date|
+---+-----------+-----------+    
|  1|         Hf|27-SEP-2017|
|  1|     Non Hf|28-SEP-2017|
|  2|         Hf|25-SEP-2017|
+---+-----------+-----------+

Using spark 2.2.0, to_date with the format signature is a new function in 2.2.0


If you are using spark < 2.2.0, you can use unix_timestamp in place of to_date:

val w = Window.partitionBy("id").orderBy(unix_timestamp($"in_date", "dd-MMM-yyyy"))
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • after applying the above logic I am getting the below error org.apache.spark.sql.AnalysisException: Could not resolve window function 'sum'. Note that, using window functions currently requires a HiveContext; Can you please also help me here ?? – Divas Nikhra Sep 16 '17 at 10:47
  • 1
    Maybe you are using spark 1.x, in which case you need `HiveContext` for the [window functions `sum`](https://stackoverflow.com/questions/40319126/spark-window-functions-requires-hivecontext). – Psidom Sep 17 '17 at 13:09
  • Can we do this without window function ? Because I am able to run this in Spark-shell but when I am creating jar in eclipse it is showing table not found error – Divas Nikhra Sep 27 '17 at 08:02