0

How can I preserve the order of column when using collect_list? I have a date column (col1) and order is not preserved when I call collect_list function on it. Here is my code for it with sample input and output.

Input DataFrame:


df = sqlContext.createDataFrame([('1', 201001,3400,1600,65,320,400,), ('1', 201002,5200,1600,65,320,400,), ('1', 201003,65,1550,32,320,400,), ('2', 201505,3200,1800,12,1,40,), ('2', 201508,3200,3200,12,1,40,), ('3', 201412,40,40,12,1,3,)], 
                                  ['ColA', 'Col1','Col2','Col3','Col4','Col5','Col6',])

+----+------+----+----+----+----+----+
|ColA|  Col1|Col2|Col3|Col4|Col5|Col6|
+----+------+----+----+----+----+----+
|   1|201001|3400|1600|  65| 320| 400|
|   1|201002|5200|1600|  65| 320| 400|
|   1|201003|  65|1550|  32| 320| 400|
|   2|201505|3200|1800|  12|   1|  40|
|   2|201508|3200|3200|  12|   1|  40|
|   3|201412|  40|  40|  12|   1|   3|
+----+------+----+----+----+----+----+

Expected Ouput:

df = sqlContext.createDataFrame([(1,['201001', '201002', '201003'],[3400, 5200, 65],[1600, 1600, 1550],[65,32],[320],[400],), (2,['201505', '201508'],[3200, 3200],[1800, 3200],[12],[1],[40],),
(3,['201412'],[40],[40],[12],[1],[3],)], ['ColA', 'Col1','Col2','Col3','Col4','Col5','Col6',])
df.show()

+----+--------------------+----------------+------------------+--------+-----+-----+
|ColA|                Col1|            Col2|              Col3|    Col4| Col5| Col6|
+----+--------------------+----------------+------------------+--------+-----+-----+
|   1|[201001, 201002, ...|[3400, 5200, 65]|[1600, 1600, 1550]|[65, 32]|[320]|[400]|
|   2|    [201505, 201508]|    [3200, 3200]|      [1800, 3200]|    [12]|  [1]| [40]|
|   3|            [201412]|            [40]|              [40]|    [12]|  [1]|  [3]|
+----+--------------------+----------------+------------------+--------+-----+-----+

This is the code that works but doesn't store the order of col1:


def aggregation(df, groupby_column, cols_to_list, cols_to_set):
  exprs = [F.collect_list(F.col(c)).alias(c) for c in cols_to_list]\
          + [F.collect_set(F.col(c)).alias(c) for c in cols_to_set]
  return df.groupby(*groupby_column).agg(*exprs)

groupby_column = ['ColA']

cols_to_list = ['Col1', 'Col2', 'Col3',]
cols_to_set = ['Col4', 'Col5', 'Col6',]

df = aggregation(df, groupby_column, cols_to_list, cols_to_set)
Rob
  • 468
  • 3
  • 15
  • 6
    Possible duplicate of [collect\_list by preserving order based on another variable](https://stackoverflow.com/questions/46580253/collect-list-by-preserving-order-based-on-another-variable), specifically [this answer](https://stackoverflow.com/a/50668635/5858851). – pault Jun 27 '19 at 20:14
  • @pault how did you get the link to specific "answer" "https://stackoverflow.com/a/50668635/5858851" in https://stackoverflow.com/questions/46580253/collect-list-by-preserving-order-based-on-another-variable? – C.S.Reddy Gadipally Jun 27 '19 at 20:55
  • 2
    @C.S.ReddyGadipally Underneath the answer, on the left, you will see "share edit flag". Click "share" to create that link. – TheWandererLee Jun 27 '19 at 22:18

1 Answers1

0

Thanks to @pault, I was able to understand what the issue was. The solution posted on the other page is complicated especially if you have too many columns to work with and plan to use both collect_list and collect_set functions. I was able to get around it by performing an orderBy paired with repartition so that all of my data is on a single partition instead of multiple partitions which was causing the issues to begin with. Remember, repartition is an expensive operation, so be careful with the usage.

Here is the code for anyone list:


def aggregation(df, groupby_column, cols_to_list, cols_to_set):
  df = df.orderBy(colA).repartition(1)
  exprs = [F.collect_list(F.col(c)).alias(c) for c in cols_to_list]\
          + [F.collect_set(F.col(c)).alias(c) for c in cols_to_set]
  return df.groupby(*groupby_column).agg(*exprs)

groupby_column = ['ColA']

cols_to_list = ['Col1', 'Col2', 'Col3',]
cols_to_set = ['Col4', 'Col5', 'Col6',]

df = aggregation(df, groupby_column, cols_to_list, cols_to_set)
Rob
  • 468
  • 3
  • 15