Convert distinct values in a Dataframe in Pyspark to a list

Question

I'm trying to get the distinct values of a column in a dataframe in Pyspark, to them save them in a list, at the moment the list contains "Row(no_children=0)" but I need only the value as I will use it for another part of my code.

So, ideally only all_values=[0,1,2,3,4]

all_values=sorted(list(df1.select('no_children').distinct().collect()))
all_values


[Row(no_children=0),
 Row(no_children=1),
 Row(no_children=2),
 Row(no_children=3),
 Row(no_children=4)]

This takes around 15secs to run, is that normal?

Thank you very much!

score 12 · Accepted Answer · answered Aug 08 '17 at 06:50

12

You can use collect_set from functions module to get a column's distinct values.Here,

from pyspark.sql import functions as F
>>> df1.show()
+-----------+
|no_children|
+-----------+
|          0|
|          3|
|          2|
|          4|
|          1|
|          4|
+-----------+

>>> df1.select(F.collect_set('no_children').alias('no_children')).first()['no_children']
[0, 1, 2, 3, 4]

answered Aug 08 '17 at 06:50

Suresh

5,678
2
24
40

Fantastic, this option is quicker. Although the command line prints WARN TaskSetManager: Stage 849 contains a task of very large size (165 KB). The maximum recommended task size is 100 KB. – VMEscoli Aug 08 '17 at 12:38
This usually occurs either when huge list transferred from driver to executor or due to partitioning of data. Pls check this, https://stackoverflow.com/questions/28878654/spark-using-python-how-to-resolve-stage-x-contains-a-task-of-very-large-size-x – Suresh Aug 08 '17 at 17:12
Anyways, hope this answer helped you. If you are fine with it, can you accept it. – Suresh Aug 08 '17 at 17:13
how can i accept? I had seen that post but didn't really understand what to do, but thanks! – VMEscoli Aug 08 '17 at 18:14

score 1 · Answer 2 · answered Aug 08 '17 at 01:24

1

You could do something like this to get only the values

list = [r.no_children for r in all_values]

list
[0, 1, 2, 3, 4]

answered Aug 08 '17 at 01:24

Ankush Singh

560
7
17

score 0 · Answer 3 · answered Oct 11 '22 at 12:55

0

Try this:

all_values = df1.select('no_children').distinct().rdd.flatMap(list).collect()

answered Oct 11 '22 at 12:55

snark

2,462
3
32
63

Convert distinct values in a Dataframe in Pyspark to a list

3 Answers3

Linked