Pyspark - Convert column to list

Question

Spark 3.0

I ran a code df.select("Name").collect(), and I received this output below. I want to put the result below in a list. I tried adding [0] to the end, but that didn't work.

Row(Name='Andy')
Row(Name='Brandon')
Row(Name='Carl')

expected outcome = ['Andy','Brandon','Carl']

You could simply run : expected_outcome = [a["Name"] for a in df.select("Name").collect() ] — narjes Karmeni, Aug 18 '20 at 15:34

score 6 · Answer 1 · answered Aug 18 '20 at 14:16

6

You can use rdd.

df.select('Name').rdd.map(lambda x: x[0]).collect()

['Andy', 'Brandon', 'Carl']

answered Aug 18 '20 at 14:16

Lamanus

12,898
4
21
47

notNull · Accepted Answer · 2020-08-18T14:20:03.657

5

Use collect_list then get only the list by accessing index and assigned to variable.

Example:

df.show()
#+-------+
#|   Name|
#+-------+
#|   Andy|
#|Brandon|
#|   Carl|
#+-------+

output=df.agg(collect_list(col("name"))).collect()[0][0]

output
#['Andy', 'Brandon', 'Carl']

Another way would be using list comprehension:

ss=df.select("Name").collect()

output=[i[0] for i in ss]

output
#['Andy', 'Brandon', 'Carl']

edited Aug 18 '20 at 14:20

answered Aug 18 '20 at 14:13

notNull

30,258
4
35
50

`collect_list` worked fine for me – Purushothaman Srikanth Mar 23 '22 at 08:38

Pyspark - Convert column to list

2 Answers2