1

I have two dataframes in Scala:

df1 =

ID  Field1
1   AAA
2   BBB
4   CCC

and

df2 =

PK  start_date_time
1   2016-10-11 11:55:23
2   2016-10-12 12:25:00
3   2016-10-12 16:20:00

I also have a variable start_date with the format yyyy-MM-dd equal to 2016-10-11.

I need to create a new column check in df1 based on the following condition: If PK is equal to ID AND the year, month and day of start_date_time are equal to start_date, then check is equal to 1, otherwise 0.

The result should be this one:

df1 =

ID  Field1  check
1   AAA     1
2   BBB     0
4   CCC     0

In my previous question I had two dataframes and it was suggested to use joining and filtering. However, in this case it won't work. My initial idea was to use udf, but not sure how to make it working for this case.

Community
  • 1
  • 1
user7379562
  • 349
  • 2
  • 6
  • 15

1 Answers1

1

You can combine join and withColumn for this case. i.e. firstly join with df2 on ID column and then use when.otherwise syntax to modify the check column:

import org.apache.spark.sql.functions.lit

val df2_date = df2.withColumn("date", to_date(df2("start_date_time"))).withColumn("check", lit(1)).select($"PK".as("ID"), $"date", $"check")

df1.join(df2_date, Seq("ID"), "left").withColumn("check", when($"date" === "2016-10-11", $"check").otherwise(0)).drop("date").show

+---+------+-----+
| ID|Field1|check|
+---+------+-----+
|  1|   AAA|    1|
|  2|   BBB|    0|
|  4|   CCC|    0|
+---+------+-----+

Or another option, firstly filter on df2, and then join it back with df1 on ID column:

val df2_date = (df2.withColumn("date", to_date(df2("start_date_time"))).
                    filter($"date" === "2016-10-11").
                    withColumn("check", lit(1)).
                    select($"PK".as("ID"), $"date", $"check"))

df1.join(df2_date, Seq("ID"), "left").drop("date").na.fill(0).show

+---+------+-----+
| ID|Field1|check|
+---+------+-----+
|  1|   AAA|    1|
|  2|   BBB|    0|
|  4|   CCC|    0|
+---+------+-----+

In case you have a date like 2016-OCT-11, you can convert it sql Date for comparison as follows:

val format = new java.text.SimpleDateFormat("yyyy-MMM-dd")
val parsed = format.parse("2016-OCT-11")
val date = new java.sql.Date(parsed.getTime())
// date: java.sql.Date = 2016-10-11
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Does `to_date` automatically convert the values in `start_date_time` to `yyyy-MM-dd`? What if I have `yyyy-MMM-dd` in `start_date`? How then `start_date_time` will be compared to `start_date`? – user7379562 Jan 17 '17 at 17:57
  • In that case, you should be able to use `start_date_time` directly without convert it using `to_date`. – Psidom Jan 17 '17 at 17:58
  • I mean, let's say `start_date_time` is equal to `2016-10-11 11:55:23` and it should be compared to `start_date` equal to `2016-OCT-11` (`yyyy-MMM-dd`). So, where do I define the formatting of `start_date_time`? – user7379562 Jan 17 '17 at 18:00
  • I cannot compile symbols `$`. Which library should I import? I am using Spark 1.6.2. – user7379562 Jan 17 '17 at 18:30
  • `$` is spark column syntax, not from other libraries. Try `import org.apache.spark.sql.functions.lit`. – Psidom Jan 17 '17 at 18:35
  • It still says `error: value $ is not a member of StringContext` – user7379562 Jan 17 '17 at 18:37
  • what spark version are you using? I am not getting error message with spark 2.0.2 – Psidom Jan 17 '17 at 18:40
  • I am using Spark 1.6.2 and I cannot switch to Spark 2 as it's the production environment. – user7379562 Jan 17 '17 at 18:42
  • You can try split the chained transformation commands above and use `dfxx("xxx")` syntax to replace `$"xxx"` here `dfxx` is the data frame name and `xxx` is the column name. Not sure if that will solve the problem though. Also see [here](http://stackoverflow.com/questions/30445476/spark-sbt-package-value-is-not-a-member-of-stringcontext-missing-sca) – Psidom Jan 17 '17 at 18:44
  • Try `import sqlContext.implicits._` – Psidom Jan 17 '17 at 18:49
  • 1
    import sqlContext.implicits._ solved the problem. Thanks. – user7379562 Jan 17 '17 at 18:51
  • I published a very similar question http://stackoverflow.com/questions/41708265/join-two-dataframes-by-id If you are interested, you may take a look. – user7379562 Jan 17 '17 at 22:47