0

Suppose I have the following DataFrame.

import pyspark.sql.functions as f
from pyspark.sql.window import Window

l =[( 9    , 1,  'A' ),
    ( 9    , 2, 'B'  ),
    ( 9    , 3, 'C'  ),
    ( 9    , 4, 'D'  ),
    ( 10   , 1, 'A'  ),
    ( 10   , 2, 'B' )]
df = spark.createDataFrame(l, ['prod','rank', 'value'])
df.show()

+----+----+-----+
|prod|rank|value|
+----+----+-----+
|   9|   1|    A|
|   9|   2|    B|
|   9|   3|    C|
|   9|   4|    D|
|  10|   1|    A|
|  10|   2|    B|
+----+----+-----+

How can I create a new frame with an array with the values of the value column sorted based on the rank?

Desired Output:

l =[( 9    , ['A','B','C','D'] ),
    ( 10   , ['A','B'])]

l = spark.createDataFrame(l, ['prod', 'conc'])

+----+------------+
|prod|        conc|
+----+------------+
|   9|[A, B, C, D]|
|  10|      [A, B]|
+----+------------+
pault
  • 41,343
  • 15
  • 107
  • 149
lolo
  • 646
  • 2
  • 7
  • 19
  • 1
    Possible duplicate of [collect\_list by preserving order based on another variable](https://stackoverflow.com/questions/46580253/collect-list-by-preserving-order-based-on-another-variable). – pault Dec 17 '18 at 21:08
  • Specifically look at [this answer](https://stackoverflow.com/a/50668635/5858851) for a non-udf solution. – pault Dec 17 '18 at 21:15

2 Answers2

0
df = df.orderBy(["prod", "rank"], ascending=[1, 1])
df = df.rdd.map(lambda r: (r.prod, r.value)).reduceByKey(lambda x,y: list(x) + list(y)).toDF(['prod','conc'])
df.show()
+----+------------+
|prod|        conc|
+----+------------+
|   9|[A, B, C, D]|
|  10|      [A, B]|
+----+------------+
cph_sto
  • 7,189
  • 12
  • 42
  • 78
-1

Here's a quick solution based on what you've specified. Hope it helps

w = Window.partitionBy('prod').orderBy('rank')
desiredDF = df.withColumn('values_list', f.collect_list('value').over(w)).groupBy('prod').agg(f.max('values_list').alias('conc'))
desiredDF.show()

+----+------------+
|prod|        conc|
+----+------------+
|   9|[A, B, C, D]|
|  10|      [A, B]|
+----+------------+
Abhinandan Dubey
  • 655
  • 2
  • 9
  • 15