add new column with condition and group

Question

df = spark.createDataFrame(
    [
        ['A', '1', '3'],
        ['A', '2', '7'],
        ['A', '3', '1'],
        ['A', '1', '5'],
        ['A', '3', '4'],
        ['A', '5', '2'],
        ['B', '1', '8'],
        ['B', '2', '4'],
        ['B', '4', '2'],
        ['B', '6', '8']
    ],
    ['col1', 'col2', 'col3']
)
df.show()

Grouping by col1, and getting value of col2 as condition to add new column:

+----+------------+------------+
|col1|        col2|        col3|
+----+------------+------------+
|   A|   [1, 2, 3]|   [3, 7, 1]|
|   A|   [1, 3, 5]|   [5, 4, 2]|
|   B|[1, 2, 4, 6]|[8, 4, 2, 8]|
+----+------------+------------+

changed the content of the question, added one column to order those rows, if some value in this column are duplicate, please don't care the order of those rows:

df = spark.createDataFrame(
    [
        ['A', '1', '3','2'],
        ['A', '2', '7','2'],
        ['A', '3', '1','2'],
        ['A', '1', '5','3'],
        ['A', '3', '4','3'],
        ['A', '5', '2','4'],
        ['B', '1', '8','4'],
        ['B', '2', '4','5'],
        ['B', '4', '2','6'],
        ['B', '6', '8','7']
    ],
    ['col1', 'col2', 'col3', 'col4']
)
df.show()

How to you determine the first row be `[1, 2, 3]` and the second row be `[1, 3, 5]` of `A`? — Jonathan Lam, Aug 30 '22 at 11:06
if col2 = 1 and following rows, add those to one row, so the first row is [1, 2, 3] and the second row is [1, 3, 5]. — Ivan Lee, Aug 30 '22 at 11:45
Spark does not know the order of rows unless you explicitly tell the order. 1-2-3-1-3-5 could become 5-1-1-3-3-2 and both of these dataframes would be equivalent. I want to say that you **must** have the column for ordering (some other column may have values 1-2-3-4-5-6, it could have date or time too, anything so that you could use it for ordering). — ZygD, Aug 30 '22 at 12:26
@zygD, thank you to point my problem. I changed the question and added one column. — Ivan Lee, Aug 31 '22 at 00:18

score 1 · Accepted Answer · answered Aug 31 '22 at 02:36

As you have a new column for sorting, you can use .sum() and Window to create a group column:

df = df.orderBy(['col1', 'col4', 'col2'])
df = df.withColumn(
    'group', 
    func.sum(func.when(func.col('col2')=="1", 1).otherwise(0)).over(Window.partitionBy(func.col('col1')).orderBy(func.asc(func.col('col4'))))
)

df.show()
+----+----+----+----+-----+
|col1|col2|col3|col4|group|
+----+----+----+----+-----+
|   A|   1|   3|   2|    1|
|   A|   2|   7|   2|    1|
|   A|   3|   1|   2|    1|
|   A|   1|   5|   3|    2|
|   A|   3|   4|   3|    2|
|   A|   5|   2|   4|    2|
|   B|   1|   8|   4|    1|
|   B|   2|   4|   5|    1|
|   B|   4|   2|   6|    1|
|   B|   6|   8|   7|    1|
+----+----+----+----+-----+

Then you can use the grouping and collect list:

df\
    .groupby('col1', 'group')\
    .agg(
        func.collect_list('col2').alias('col2'),
        func.collect_list('col3').alias('col3')
    )\
    .drop('group')\
    .show(10, False)
+----+------------+------------+
|col1|col2        |col3        |
+----+------------+------------+
|A   |[1, 2, 3]   |[3, 7, 1]   |
|A   |[1, 3, 5]   |[5, 4, 2]   |
|B   |[1, 2, 4, 6]|[8, 4, 2, 8]|
+----+------------+------------+

add new column with condition and group

1 Answers1