df = spark.createDataFrame(
[
['A', '1', '3'],
['A', '2', '7'],
['A', '3', '1'],
['A', '1', '5'],
['A', '3', '4'],
['A', '5', '2'],
['B', '1', '8'],
['B', '2', '4'],
['B', '4', '2'],
['B', '6', '8']
],
['col1', 'col2', 'col3']
)
df.show()
Grouping by col1, and getting value of col2 as condition to add new column:
+----+------------+------------+
|col1| col2| col3|
+----+------------+------------+
| A| [1, 2, 3]| [3, 7, 1]|
| A| [1, 3, 5]| [5, 4, 2]|
| B|[1, 2, 4, 6]|[8, 4, 2, 8]|
+----+------------+------------+
changed the content of the question, added one column to order those rows, if some value in this column are duplicate, please don't care the order of those rows:
df = spark.createDataFrame(
[
['A', '1', '3','2'],
['A', '2', '7','2'],
['A', '3', '1','2'],
['A', '1', '5','3'],
['A', '3', '4','3'],
['A', '5', '2','4'],
['B', '1', '8','4'],
['B', '2', '4','5'],
['B', '4', '2','6'],
['B', '6', '8','7']
],
['col1', 'col2', 'col3', 'col4']
)
df.show()