pyspark: count number of occurrences of distinct elements in lists

Question

I have to following data:

data = {'date': ['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04', '2014-01-05', '2014-01-06'],
     'flat': ['A;A;B', 'D;P;E;P;P', 'H;X', 'P;Q;G', 'S;T;U', 'G;C;G']}

data['date'] = pd.to_datetime(data['date'])

data = pd.DataFrame(data)
data['date'] = pd.to_datetime(data['date'])
spark = SparkSession.builder \
    .master('local[*]') \
    .config("spark.driver.memory", "500g") \
    .appName('my-pandasToSparkDF-app') \
    .getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.sparkContext.setLogLevel("OFF")

df=spark.createDataFrame(data)
new_frame = df.withColumn("list", F.split("flat", "\;"))

I would like to add a new column which holds the number of occurrences of each distinct element (sorted in ascending order) and another column which holds the maximum:

+-------------------+-----------+---------------------+-----------+----+
|               date| flat      | list                |occurrences|max |
+-------------------+-----------+---------------------+-----------+----+
|2014-01-01 00:00:00|A;A;B      |['A','A','B']        |[1,2]      |2   |
|2014-01-02 00:00:00|D;P;E;P;P  |['D','P','E','P','P']|[1,1,3]    |3   |
|2014-01-03 00:00:00|H;X        |['H','X']            |[1,1]      |1   |
|2014-01-04 00:00:00|P;Q;G      |['P','Q','G']        |[1,1,1]    |1   |
|2014-01-05 00:00:00|S;T;U      |['S','T','U']        |[1,1,1]    |1   |
|2014-01-06 00:00:00|G;C;G      |['G','C','G']        |[1,2]      |2   |  
+-------------------+-----------+---------------------+-----------+----+

Thank you very much!

Is the order of column : occurances significant for you? – Sreeram TP Apr 12 '20 at 13:14 — Sreeram TP, Apr 12 '20 at 13:14

murtihash · Answer 1 · 2020-04-12T20:56:30.800

For Spark2.4+ this can be achieved without multiple groupBys and aggregations(as they are expensive shuffle operations in big data). You can do this using one expression of higher order functions transform and aggregate. This should be the canonical solution for spark2.4.

from pyspark.sql import functions as F
df=spark.createDataFrame(data)
df.withColumn("list", F.split("flat","\;"))\
  .withColumn("occurances", F.expr("""array_sort(transform(array_distinct(list), x-> aggregate(list, 0,(acc,t)->acc+IF(t=x,1,0))))"""))\
  .withColumn("max", F.array_max("occurances"))\
  .show()
+-------------------+---------+---------------+----------+---+
|               date|     flat|           list|occurances|max|
+-------------------+---------+---------------+----------+---+
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|    [1, 2]|  2|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]| [1, 1, 3]|  3|
|2014-01-03 00:00:00|      H;X|         [H, X]|    [1, 1]|  1|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]| [1, 1, 1]|  1|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]| [1, 1, 1]|  1|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|    [1, 2]|  2|
+-------------------+---------+---------------+----------+---+

Sreeram TP · Accepted Answer · 2020-04-12T13:13:11.223

You can do this by a couple of groupBy statements,

To start with you have a dataframe like this,

+-------------------+---------+---------------+
|               date|     flat|           list|
+-------------------+---------+---------------+
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|
|2014-01-03 00:00:00|      H;X|         [H, X]|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|
+-------------------+---------+---------------+

Explode the list columns using F.explode like this,

new_frame_exp = new_frame.withColumn("exp", F.explode('list'))

Then, your dataframe will look like this,

+-------------------+---------+---------------+---+
|               date|     flat|           list|exp|
+-------------------+---------+---------------+---+
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|  A|
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|  A|
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|  B|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  D|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  P|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  E|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  P|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  P|
|2014-01-03 00:00:00|      H;X|         [H, X]|  H|
|2014-01-03 00:00:00|      H;X|         [H, X]|  X|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  P|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  Q|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  G|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  S|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  T|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  U|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|  G|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|  C|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|  G|
+-------------------+---------+---------------+---+

On this dataframe, do a groupBy like this,

new_frame_exp_agg = new_frame_exp.groupBy('date', 'flat', 'list', 'exp').count()

Then you will have a dataframe like this,

+-------------------+---------+---------------+---+-----+
|               date|     flat|           list|exp|count|
+-------------------+---------+---------------+---+-----+
|2014-01-03 00:00:00|      H;X|         [H, X]|  H|    1|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  G|    1|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  U|    1|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  T|    1|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  P|    1|
|2014-01-03 00:00:00|      H;X|         [H, X]|  X|    1|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|  G|    2|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  E|    1|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|  C|    1|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]|  S|    1|
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|  B|    1|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  D|    1|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]|  Q|    1|
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|  A|    2|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]|  P|    3|
+-------------------+---------+---------------+---+-----+

On this dataframe, apply one more level of aggregation to collect the counts to list and find max like this,

res = new_frame_exp_agg.groupBy('date', 'flat', 'list').agg(
                                         F.collect_list('count').alias('occurances'),
                                         F.max('count').alias('max'))

res.orderBy('date').show()


+-------------------+---------+---------------+----------+---+
|               date|     flat|           list|occurances|max|
+-------------------+---------+---------------+----------+---+
|2014-01-01 00:00:00|    A;A;B|      [A, A, B]|    [2, 1]|  2|
|2014-01-02 00:00:00|D;P;E;P;P|[D, P, E, P, P]| [1, 1, 3]|  3|
|2014-01-03 00:00:00|      H;X|         [H, X]|    [1, 1]|  1|
|2014-01-04 00:00:00|    P;Q;G|      [P, Q, G]| [1, 1, 1]|  1|
|2014-01-05 00:00:00|    S;T;U|      [S, T, U]| [1, 1, 1]|  1|
|2014-01-06 00:00:00|    G;C;G|      [G, C, G]|    [1, 2]|  2|
+-------------------+---------+---------------+----------+---+

If you want the column occurance sorted, you can use F.array_sort over the column if you are on spark 2.4+ else you have to write a udf for that.

pyspark: count number of occurrences of distinct elements in lists

2 Answers2

Linked