It would be indeed good if you can come up with an example and the expected output. It is not clear why you use countDistinct
if you want to check for occurrences of a value. Then you should rather use count
in a groupBy
statement.
This snippet might help you nevertheless:
df_new = spark.createDataFrame([
(1, datetime.datetime(2018,9,1,12)), (1, datetime.datetime(2018,9,1,12)), (1,datetime.datetime(2018,9,1,12)), (1,datetime.datetime(2018,9,1,12)),
(1,datetime.datetime(2018,9,2,13)), (1,datetime.datetime(2018,9,2,13)), (1,datetime.datetime(2018,9,2,13)),(2,datetime.datetime(2018,9,1,13)), (2,datetime.datetime(2018,9,1,13)), (2,datetime.datetime(2018,9,1,13))
], ("id", "time"))
occurences_df = df_new.groupBy("time").count().withColumnRenamed("time","count_time")
df_new.join(occurences_df, df_new["time"]==occurences_df["count_time"],how="left").show()
Output:
+---+-------------------+-------------------+-----+
| id| time| count_time|count|
+---+-------------------+-------------------+-----+
| 1|2018-09-01 12:00:00|2018-09-01 12:00:00| 4|
| 1|2018-09-01 12:00:00|2018-09-01 12:00:00| 4|
| 1|2018-09-01 12:00:00|2018-09-01 12:00:00| 4|
| 1|2018-09-01 12:00:00|2018-09-01 12:00:00| 4|
| 2|2018-09-01 13:00:00|2018-09-01 13:00:00| 3|
| 2|2018-09-01 13:00:00|2018-09-01 13:00:00| 3|
| 2|2018-09-01 13:00:00|2018-09-01 13:00:00| 3|
| 1|2018-09-02 13:00:00|2018-09-02 13:00:00| 3|
| 1|2018-09-02 13:00:00|2018-09-02 13:00:00| 3|
| 1|2018-09-02 13:00:00|2018-09-02 13:00:00| 3|
+---+-------------------+-------------------+-----+
Then you can filter by the count column for rows with the desired number of occurrences.