Limit occurence from a value in a column with pyspark

Question

I would like to limit occurence from a value in pyspark. I've tried :

table = table.filter(countDistinct(date_format(table['stamp'], 'yyyy-MM-dd')) == 4)

But it doesn't work because I have an error :

An error occurred while calling o110.showString.
: java.lang.UnsupportedOperationException: Cannot     evaluate expression: count(distinct date_format(cast(input[13, string, true] as timestamp) etc.

Do you have an other idea ?

It's not clear what your objective is with this code. Please see [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) and try to include a small sample input and desired output. Aggregate functions need to be used in a `Window` or a `groupBy`. — pault, Aug 22 '18 at 14:08

score 0 · Accepted Answer · answered Sep 06 '18 at 13:24

It would be indeed good if you can come up with an example and the expected output. It is not clear why you use countDistinct if you want to check for occurrences of a value. Then you should rather use count in a groupBy statement.

This snippet might help you nevertheless:

df_new = spark.createDataFrame([
(1, datetime.datetime(2018,9,1,12)), (1, datetime.datetime(2018,9,1,12)),   (1,datetime.datetime(2018,9,1,12)), (1,datetime.datetime(2018,9,1,12)),
(1,datetime.datetime(2018,9,2,13)), (1,datetime.datetime(2018,9,2,13)),  (1,datetime.datetime(2018,9,2,13)),(2,datetime.datetime(2018,9,1,13)), (2,datetime.datetime(2018,9,1,13)), (2,datetime.datetime(2018,9,1,13))
], ("id", "time"))


occurences_df = df_new.groupBy("time").count().withColumnRenamed("time","count_time")
df_new.join(occurences_df, df_new["time"]==occurences_df["count_time"],how="left").show()

Output:

+---+-------------------+-------------------+-----+
| id|               time|         count_time|count|
+---+-------------------+-------------------+-----+
|  1|2018-09-01 12:00:00|2018-09-01 12:00:00|    4|
|  1|2018-09-01 12:00:00|2018-09-01 12:00:00|    4|
|  1|2018-09-01 12:00:00|2018-09-01 12:00:00|    4|
|  1|2018-09-01 12:00:00|2018-09-01 12:00:00|    4|
|  2|2018-09-01 13:00:00|2018-09-01 13:00:00|    3|
|  2|2018-09-01 13:00:00|2018-09-01 13:00:00|    3|
|  2|2018-09-01 13:00:00|2018-09-01 13:00:00|    3|
|  1|2018-09-02 13:00:00|2018-09-02 13:00:00|    3|
|  1|2018-09-02 13:00:00|2018-09-02 13:00:00|    3|
|  1|2018-09-02 13:00:00|2018-09-02 13:00:00|    3|
+---+-------------------+-------------------+-----+

Then you can filter by the count column for rows with the desired number of occurrences.

Limit occurence from a value in a column with pyspark

1 Answers1