0

I have two dataframes.

One has a range of dates, with every hour of the day assigned to each date

+----------+----+
|      date|hour|
+----------+----+
|2020-12-20|   0|
|2020-12-20|   1|
|2020-12-20|   2|
|2020-12-20|   3|
|2020-12-20|   4|
|2020-12-20|   5|
|2020-12-20|   6|
|2020-12-20|   7|
|2020-12-20|   8|
|2020-12-20|   9|

The second one has users with dates and hours, but a user has only a few days and a few hours, not all of them:

+----------------+----------+----+------+
|date            |   user_id|hour|   cnt|
+----------------+----------+----+------+
|      2020-12-20|1234567890|  18|    21|
|      2020-12-20|    123456|   7|     4|
|      2020-12-20|    123456|  11|     1|
|      2020-12-20|1234567890|  14|    19|

I need a dataframe with all dates and all hours assigned to every user. If there's no info for some user at a certain time and/or day, then I still want to show that user id but with cnt = 0. How can I achieve this? With a left join I get nulls.

ZygD
  • 22,092
  • 39
  • 79
  • 102
Sapehi
  • 133
  • 1
  • 8
  • "_With a left join I get nulls._" Please show us the code you've tried. And add a [complete reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). – blackbishop Jan 19 '21 at 19:57

1 Answers1

0

You can do an inner join based on date only, and then modify the cnt column to show 0 when the hours don't match:

import pyspark.sql.functions as F

result = (df1.join(df2.withColumnRenamed('hour', 'hour2'), 'date')
             .withColumn('cnt', F.when(F.col('hour') == F.col('hour2'), 
                                       F.col('cnt')).otherwise(0))
             .drop('hour2')
         )
mck
  • 40,932
  • 13
  • 35
  • 50
  • 1
    @Sapehi if the answer is helpful, please kindly accept it by clicking the tick button, thanks :) – mck Jan 20 '21 at 10:01